A self-contained walkthrough for the Pi 5. Install llama.cpp, transfer a .kolm export, run inference. Realistic about the 4GB vs 8GB tradeoff: 1B and 3B int4 are comfortable, 7B int4 only fits on the 8GB board with a tight budget.
Raspberry Pi OS (Bookworm or later) has the build tools in apt. Build llama.cpp from source for the latest gguf format support.
$ sudo apt update $ sudo apt install -y build-essential cmake git $ git clone https://github.com/ggerganov/llama.cpp $ cd llama.cpp $ make -j4 # the build produces ./llama-cli, ./llama-server, ./quantize, etc. $ ./llama-cli --version
This takes about 6 minutes on the Pi 5. NEON SIMD is on by default and is what makes int4 decode usable at this scale.
.kolm on the source machine.Run this on whatever box you compiled the .kolm on, not the Pi. Use --quant int4 for the 4GB board, --quant int8 or int4 for the 8GB board depending on the base model size.
$ kolm export your-artifact.kolm \ --backend gguf \ --device "Raspberry Pi 5 (8GB)" \ --quant int4 \ --out ./exports/
Pi 5 runs Raspberry Pi OS with SSH on by default if you flashed with imager. Use scp for one-shot transfers, rsync if the connection is flaky.
$ scp ./exports/your-artifact-int4.gguf pi@raspberrypi.local:~/models/ # resumable alternative: $ rsync -avP ./exports/your-artifact-int4.gguf pi@raspberrypi.local:~/models/
If the Pi cannot reach the LAN (air-gapped install), write the file to a USB key and copy it over locally.
SSH into the Pi and point llama-cli at the file. NEON-only decode is enough for a 3B int4 model. 8B int4 will run on the 8GB board but at 1 to 2 tok/s, which is fine for batch jobs but not for an interactive UI.
$ ssh pi@raspberrypi.local $ cd ~/llama.cpp $ ./llama-cli -m ~/models/your-artifact-int4.gguf \ -p "Summarize this paragraph in one sentence." \ -n 256 \ --temp 0.2 \ -t 4
For an OpenAI-compatible HTTP server, use llama-server instead:
$ ./llama-server -m ~/models/your-artifact-int4.gguf --port 8080 -t 4
This binds to http://raspberrypi.local:8080/v1/chat/completions. Drop in an OpenAI client and you have a private endpoint on a 50-dollar board.
Quantization can cost up to 0.05 K-points. Confirm the drop is within your task tolerance before procurement signs off. Run kolm verify on the source .kolm to regenerate the binder, and run a small eval set on the Pi to compare.
# on the source machine $ kolm verify your-artifact.kolm --binder report.html # on the Pi, after pulling a small eval set $ ./llama-cli -m ~/models/your-artifact-int4.gguf -f ~/evals/sample.txt
For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.