Device transfer: Raspberry Pi 5 quickstart

SoCBCM2712 (Cortex-A76 x4)

RAM4GB or 8GB LPDDR4X

StoragemicroSD or NVMe HAT

BackendGGUF / llama.cpp

Throughput (3B int4)3 to 5 tok/s

PowerUSB-C PD 5V/5A

Step 1. Install the llama.cpp toolchain on the Pi.

Raspberry Pi OS (Bookworm or later) has the build tools in apt. Build llama.cpp from source for the latest gguf format support.

$ sudo apt update
$ sudo apt install -y build-essential cmake git
$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make -j4
# the build produces ./llama-cli, ./llama-server, ./quantize, etc.
$ ./llama-cli --version

This takes about 6 minutes on the Pi 5. NEON SIMD is on by default and is what makes int4 decode usable at this scale.

Step 2. Export from your `.kolm` on the source machine.

Run this on whatever box you compiled the .kolm on, not the Pi. Use --quant int4 for the 4GB board, --quant int8 or int4 for the 8GB board depending on the base model size.

$ kolm export your-artifact.kolm \
              --backend gguf \
              --device "Raspberry Pi 5 (8GB)" \
              --quant int4 \
              --out ./exports/

Step 3. Move it to the Pi.

Pi 5 runs Raspberry Pi OS with SSH on by default if you flashed with imager. Use scp for one-shot transfers, rsync if the connection is flaky.

$ scp ./exports/your-artifact-int4.gguf pi@raspberrypi.local:~/models/
# resumable alternative:
$ rsync -avP ./exports/your-artifact-int4.gguf pi@raspberrypi.local:~/models/

If the Pi cannot reach the LAN (air-gapped install), write the file to a USB key and copy it over locally.

Step 4. Run on the Pi.

SSH into the Pi and point llama-cli at the file. NEON-only decode is enough for a 3B int4 model. 8B int4 will run on the 8GB board but at 1 to 2 tok/s, which is fine for batch jobs but not for an interactive UI.

$ ssh pi@raspberrypi.local
$ cd ~/llama.cpp
$ ./llama-cli -m ~/models/your-artifact-int4.gguf \
              -p "Summarize this paragraph in one sentence." \
              -n 256 \
              --temp 0.2 \
              -t 4

For an OpenAI-compatible HTTP server, use llama-server instead:

$ ./llama-server -m ~/models/your-artifact-int4.gguf --port 8080 -t 4

This binds to http://raspberrypi.local:8080/v1/chat/completions. Drop in an OpenAI client and you have a private endpoint on a 50-dollar board.

What fits on a Pi 5.

4GB board: Llama-3.2-1B int4 (0.58 GB), Llama-3.2-3B int4 (1.7 GB), Phi-3-mini int4 (2.1 GB). 7B int4 will not fit.
8GB board: all of the above, plus Llama-3.1-8B int4 (4.4 GB) at 1 to 2 tok/s. Mistral-7B int4 (3.9 GB) at 2 to 3 tok/s.

Reality check. The Pi 5 is a fan-cooled SBC, not a GPU. Steady-state decode at 3 to 5 tok/s for 3B int4 is good for batch ingestion, audit logs, on-device redaction, and slow chat. It is not a substitute for a Jetson Orin or a Mac. Measure on your actual workload before committing.

Verify the artifact stayed honest.

Quantization can cost up to 0.05 K-points. Confirm the drop is within your task tolerance before procurement signs off. Run kolm verify on the source .kolm to regenerate the binder, and run a small eval set on the Pi to compare.

# on the source machine
$ kolm verify your-artifact.kolm --binder report.html

# on the Pi, after pulling a small eval set
$ ./llama-cli -m ~/models/your-artifact-int4.gguf -f ~/evals/sample.txt

For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.

Raspberry Pi 5: build, scp, run.

Step 1. Install the llama.cpp toolchain on the Pi.

Step 2. Export from your `.kolm` on the source machine.

Step 3. Move it to the Pi.

Step 4. Run on the Pi.

What fits on a Pi 5.

Verify the artifact stayed honest.

Next.

Raspberry Pi 5: build, scp, run.

Step 1. Install the llama.cpp toolchain on the Pi.

Step 2. Export from your .kolm on the source machine.

Step 3. Move it to the Pi.

Step 4. Run on the Pi.

What fits on a Pi 5.

Verify the artifact stayed honest.

Next.

Step 2. Export from your `.kolm` on the source machine.