Edge artifact sizing: from M3 Pro down to Raspberry Pi 5

The hardware ladder
The base-size landscape
Quantization tiers
Latency budgets
A worked example
Capture, distill, refresh
Honest limits

The hardware ladder.

Edge is a four-rung ladder. Each rung is an order of magnitude smaller than the one above it in compute, memory, and power budget. The same kolm artifact format runs on all four; the base model and the quantization tier change.

Laptop class. M3 Pro / M3 Max, RTX 4070 to 4090, recent Ryzen with discrete GPU. 16-128 GB of unified or discrete memory, 30-80 watts under load, fan cooling, full disk. The compile happens here for most teams. The runtime can hold a 7B base in int4 with room for context, recall, and a comfortable batch.

Edge-box class. Jetson Orin Nano, Intel NUC with iGPU, mini-PC with an N100 or N305 CPU. 8-32 GB of memory, 15-30 watts, often fanless or low-fan, often ruggedized. This is the rung most plant-floor and retail-shelf deployments land on. A 3B base in int4 fits with room for context; a 7B base fits with care.

SBC class. Raspberry Pi 5 (8 GB), Coral Edge TPU, BeagleBone AI-64, ESP32-S3 for trivial tasks. 4-8 GB of memory, 5-15 watts, passive cooling, microSD or eMMC storage. A 1B base in int4 fits; a 3B base is on the edge of feasible but eats most of the RAM. Latency budgets stretch because there is no GPU; the CPU does all the math.

Microcontroller class. Cortex-M4/M7, ESP32-C6, Nordic nRF52. Kilobytes to a few megabytes of memory. Out of scope for an LLM-backed artifact; the recipe-pack-only path (deterministic patterns, no decoder) is the kolm story at this rung. We're not covering it here because the artifact composition is different.

The single most useful observation about the ladder is that the gap between rungs is roughly 10x, but the cost is roughly 1/3 per step down. A plant has 200 production lines; an M3 Max per line is $200k of hardware. The same problem on a Pi 5 per line is $1.6k. The artifact has to fit at the rung the unit economics allow.

The base-size landscape.

Five size buckets, roughly. The numbers below are approximate; pick the closest bucket and ignore the half-bucket subtleties.

bucket	example bases	fp16 size	int4 size	fits on
0.5B	Qwen2.5-0.5B-Instruct, TinyLlama	~1 GB	~0.3 GB	Pi 5, Coral, mid SBC
1B	Llama-3.2-1B-Instruct, Qwen2.5-1.5B	~2 GB	~0.6 GB	Pi 5 (snug), Jetson Nano
3B	Qwen2.5-3B-Instruct, Phi-3-mini	~6 GB	~1.8 GB	Jetson Orin, NUC, laptop
7-8B	Mistral-7B, Llama-3.1-8B, Qwen2.5-7B	~14-16 GB	~4-5 GB	Jetson AGX, full laptop
70B+	Llama-3.1-70B, Qwen2.5-72B	~140 GB	~40 GB	cloud or workstation only

The 70B row is on the table to be honest about where the line is: 70B does not fit on any device we'd call edge. It belongs in the cloud or in a buyer-owned datacenter rack. The 7-8B row is the upper end of edge feasibility, and only on the laptop and Jetson AGX rungs.

For most edge use cases, the right starting point is 1B or 3B. The reason is not the leaderboard score; it's that the use cases that survive at the edge are narrow tasks (classification, extraction, structured generation), not open-ended chat. A narrow task on a 1B base with a well-fitted LoRA and a focused recall index produces K-scores in the 0.85-0.95 range for the tasks we've shipped; the marginal gain from going to 7B does not justify the 4x increase in cost and latency for the device.

Quantization tiers.

Four tiers we ship and one we do not. The quantization is applied at compile time; the artifact's manifest records the tier and the resulting K-score on the embedded eval pack.

fp16. Half-precision floats, the format the base typically ships in. Reference accuracy. The K-score at fp16 is the upper bound; every other tier loses some fraction of it. 16 bits per parameter. Fits on laptop class easily, edge-box class for 3B and below, SBC class only for 0.5B.

int8. 8-bit integer weights. Typical K-score loss is small (single-digit percent on most tasks we've measured), and the size halves. The compute is faster on hardware with int8 fused multiply-add (most modern CPUs with AVX-512_VNNI, most GPUs with Tensor Cores, the Apple Neural Engine). This is a good default for edge-box class.

int4 (q4_k_m gguf). 4-bit grouped quantization, the format that the gguf ecosystem has converged on. The "k_m" variant uses different bit widths for different weight groups; the empirical result is closer to int6 quality at int4 size. Typical K-score loss versus fp16 is moderate (somewhere in the 2-8% range for the tasks we've shipped, illustrative). 4 bits per parameter. This is the default for SBC class and a reasonable choice for edge-box class.

int3 (q3_k_s gguf). 3-bit grouped quantization. The K-score loss starts to be observable, particularly on tasks with long-context dependencies or precise extraction. We ship this only when the bigger artifact has no chance of fitting; on most Pi 5 deployments, q4_k_m is preferred.

int2 and below. The K-score cliff. We do not ship below 3-bit by default. Tasks degrade unpredictably: classification accuracy can survive int2 on simple tasks, but anything requiring multi-step reasoning or numerical fidelity collapses. The compile gate is set to refuse below 3-bit unless the operator explicitly overrides it.

# the compile records the quantization tier in the manifest
$ kolm inspect vibration_anomaly-1.0.0.kolm | grep -E 'base|quant|K-score'
base:     Qwen/Qwen2.5-3B-Instruct
quant:    int4 (q4_k_m)
size:     1.84 GB on disk
K-score:  0.917 (fp16 reference: 0.943; loss: 2.8%)

The K-score loss line is the part that matters. The artifact records the fp16 reference at compile time, so the deployer sees the cost of quantization explicitly. A 2.8% loss against a 0.943 fp16 reference, landing at 0.917 on the gate, is the trade we accept; a 15% loss landing below the 0.85 gate would fail the build.

Latency budgets.

Three buckets that buyers actually use.

Below 100 ms. Consumer interactive feel. Typing-completion, voice-trigger, button-press response. The user perceives "instant". At this budget, the artifact has to fit fully in fast memory (GPU VRAM or unified memory), the base has to be small (1B or below), and the recall index has to live in RAM rather than on disk. This bucket is uncommon at the edge; most edge tasks are batch or near-real-time rather than typing-fast.

Below 500 ms. Acceptable interactive. A button press, a form submission, a sensor reading that triggers a classification. Most edge tasks land here. 3B int4 on a Jetson Orin Nano typically delivers a first-token in the 50-200 ms range, full short response in 300-450 ms (illustrative; depends heavily on prompt length and decode length).

Below 2 seconds. Batch tolerable. A scheduled scan, a periodic audit, an end-of-shift summary. Pi 5 territory. 1B int4 on a Pi 5 runs short tasks in the 500-1500 ms range. Slower decode is acceptable because there is no human waiting in real time.

The artifact-sizing rule that follows: pick the latency bucket first, then pick the rung that hits it, then pick the largest base that fits at the rung with a quantization tier above q3_k_s. If the math doesn't close, the answer is "this is not an edge task" and the deploy goes back to the cloud or the on-prem rack.

A worked example.

Vibration-anomaly classifier on a plant floor. The task is: given a 1-second window of accelerometer data plus a maintenance-log excerpt, classify the window as "nominal" or one of six fault categories (bearing wear, imbalance, misalignment, looseness, electrical fault, lubrication issue).

Constraints. Hardware budget per line: $800 for an industrial mini-PC. Latency budget: under 500 ms per classification (the line speed allows that). Network: an OT LAN with no internet; the artifact has to run airgapped (see the air-gapped deployment article). Refresh cadence: quarterly.

Hardware choice. Intel N100-based mini-PC with 16 GB RAM, $620 at list. Edge-box rung. No discrete GPU; the iGPU has limited LLM acceleration, so CPU inference is the realistic path.

Base and quant. Qwen2.5-3B-Instruct, int4 (q4_k_m). 1.84 GB on disk, fits in RAM with room for context and the recall index. The 3B size is large enough to handle the structured generation (a JSON-formatted classification with a per-category confidence) but small enough that CPU decoding hits the latency budget.

# the compile, run on the engineer's laptop, dispatched to local-cuda
$ kolm compile -t vibration_anomaly -d ./vibration_corpus.jsonl \
              --base Qwen/Qwen2.5-3B-Instruct --quant q4_k_m \
              --target-device n100
[trainer]     using local-cuda (laptop rtx 4070)
[trainer]     LoRA rank 16 on q_proj, v_proj, o_proj
[trainer]     eval pack: 200 windows, deterministic
[K-score]     0.917 (fp16 reference 0.943; loss 2.8%)
[K-score]     gate 0.85 passed
[artifact]    ./vibration_anomaly-1.0.0.kolm  1.84 GB

Latency. On the target N100 box, the artifact's p50 first-token is approximately 3.8 ms after warmup; full classification (35 tokens of JSON output) p50 lands around 280-340 ms. Example numbers, illustrative; actual numbers depend on prompt length, the specific N100 SKU, and thermal state. The point is that the budget closes with margin.

Deployment. The artifact bundles with the offline base weights and ships on a signed image to each of the 200 lines. The offline switches (KOLM_AIRGAP=1, HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1) are baked into the image. The runtime starts at boot, opens a local UDS, and answers classification calls from the PLC bridge.

The sizing decision is not "what's the best model for vibration anomaly". It is "what fits on the box that fits in the unit economics, at the latency the line tolerates, with the K-score the safety officer signs off on."

Capture, distill, refresh.

Edge devices have small disks and intermittent operator visits. The capture-distill-refresh cycle has to respect those constraints.

Capture. The runtime writes a rolling local cache of inputs, outputs, and confidence scores. Disk budget is a hard cap (typically 1-4 GB per device); old entries are evicted FIFO. The cache is encrypted at rest with a per-device key.

Sync. On a scheduled cadence (nightly or weekly, depending on the OT network's policy), a one-way sync exports the cache to the IT side. The OT-to-IT direction is allowed; the reverse is not (model updates land via signed-image sneakernet, see the air-gapped article).

Distill. The IT-side pipeline reads the captured traffic, runs it through the next-tier model for ground truth, builds a fresh recipe pack and a fresh eval set, and recompiles. The new artifact gets a new CID, a new receipt chain, and a new K-score on the updated eval pack.

Refresh. The signed image goes back to the OT side. The operator visits the box, swaps the image, and verifies (the airgap verify path). The old artifact is retained for 90 days for rollback. The audit log records the swap.

The disk-budget rule: an edge device should not be expected to hold more than one production artifact, one rollback artifact, and a rolling capture cache. Everything else (intermediate datasets, training logs, eval set history) lives on the IT side.

Honest limits.

Three places the artifact-sizing story breaks.

Quantization below int4 has a K-score cliff. q3_k_s sometimes works; q2 reliably does not. The cliff is task-dependent: classification on a small label set survives lower bits than open-ended generation. The compile gate enforces a floor of 0.85 K-score, which in practice means q4_k_m is the typical bottom of the ladder. Going lower means accepting a build failure and re-planning the deploy.

Long-context tasks are hostile to small bases. A 1B int4 base with a 16k context window does not produce 16k-quality output. Bases scale on context capability roughly with parameter count; tasks that depend on reasoning over thousands of tokens (long contracts, sprawling support threads, multi-document reasoning) push the size requirement up faster than they push the latency requirement down. If the task needs long context, the rung is laptop class or above, not SBC.

Cold-start is the operator's worst experience. A signed image swap on an edge box takes seconds to minutes depending on the disk speed and the verifier's work. The runtime cold-start (loading 1.8 GB of weights into RAM) is the dominant term. Most edge deployments are designed around "the runtime is always up", with health checks and watchdog timers; the cold-start is amortized at the moment of swap, not at the moment of first request.

For most edge problems, the sizing decision is a one-line answer once you have the hardware budget and the latency budget. The compile records the choice in the manifest; the receipt records the K-score; the audit log records the deploy. If a future operator wants to know why this artifact at this size is on this box, the receipt chain answers.

Edge artifact sizing: M3 Pro down to Pi 5.

Contents