The Orin Nano 8GB is a 25-watt edge box that runs a 7B int4 artifact at about 38 tok/s once the TensorRT engine is built. The first run pays a JIT cost of roughly 90 seconds for engine compilation, then every run after is cached. ONNX Runtime CUDA is the fallback path if TensorRT-LLM is not yet on your JetPack.
| Output artifact | phi-redactor.jetson-orin-nano.trt-engine |
|---|---|
| Engine size | about 1.4 GB |
| Throughput | 38 tok/s (7B INT4, public benchmarks) |
| K-score (estimated) | 0.88 (drift from int4 quant) |
| Fits in memory | yes (1.4 GB engine + KV cache fits under 8 GB) |
| First-run JIT | about 90 seconds (TensorRT engine build, cached after) |
JetPack 6 ships CUDA 12, cuDNN 9, and TensorRT 10 out of the box. Use NVIDIA SDK Manager from a host Linux box, or flash a fresh microSD via the L4T image. After boot, confirm the CUDA toolchain is on PATH.
$ sudo apt update $ sudo apt install -y python3-pip python3-venv git $ nvcc --version # expected: release 12.x, build cuda_12.x $ nvidia-smi # expected: Orin (nvgpu) device listed
TensorRT-LLM is the NVIDIA-blessed path for LLM inference on Jetson. It compiles a static engine that runs at near-peak Ampere throughput. ONNX Runtime CUDA is the portable fallback if a JetPack release lags behind TensorRT-LLM upstream.
# TensorRT-LLM (preferred) $ python3 -m venv ~/kolm-trt $ source ~/kolm-trt/bin/activate $ pip install --upgrade pip $ pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com # ONNX Runtime CUDA (fallback) $ pip install onnxruntime-gpu optimum[exporters]
If tensorrt_llm import fails, the JetPack version is older than the wheel. Either upgrade JetPack or use the ONNX fallback. The kolm export command handles both.
.kolm on the source machine.Run this on whatever box you compiled the .kolm on, not the Jetson. The export produces a serialized TensorRT engine targeted at Orin Nano (sm_87). Use --quant int4 for the 8 GB module.
$ kolm export phi-redactor.kolm --device jetson-orin-nano --backend trt-llm --quant int4 # output: phi-redactor.jetson-orin-nano.trt-engine (about 1.4 GB)
The output is a single file at phi-redactor.jetson-orin-nano.trt-engine plus a small kolm.manifest.json for verify. Both ship together.
Jetson Orin Nano DevKit ships with Ubuntu and SSH ready out of the box. Use scp for one-shot transfers, rsync if the LAN is flaky.
$ scp phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/ $ scp kolm.manifest.json nvidia@jetson.local:~/engines/ # resumable alternative: $ rsync -avP phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/
SSH into the Jetson and invoke kolm run against the engine file. The first run JIT-builds the runtime context, which takes about 90 seconds. After that the engine is cached in ~/.kolm/trt-cache and every run is instant.
$ ssh nvidia@jetson.local $ kolm run phi-redactor.jetson-orin-nano.trt-engine 'MRN 123456 SSN 555-44-3333' # first run: about 90s for engine warmup, then 38 tok/s steady state # subsequent runs: instant warmup
For an OpenAI-compatible HTTP server, use the TensorRT-LLM serve harness directly:
$ trtllm-serve --engine_dir ~/engines/phi-redactor.jetson-orin-nano.trt-engine --port 8080 # OpenAI-compatible endpoint at http://jetson.local:8080/v1/chat/completions
sudo nvpmodel -m 0 and sudo jetson_clocks to get the 38 tok/s figure above. At 7W (battery mode), expect about 18 tok/s on the same workload.--quant int3). This brings K-est to about 0.84, which may fail your ship gate. Verify against the eval pack before going lower than int4.~/.kolm/trt-cache and every subsequent run is instant.tegrastats for GPU clock. If the clock drops below 800 MHz the module is thermal-throttling. Add a fan, or run sudo jetson_clocks to pin the clock at max.tensorrt_llm wheel does not import, fall back to --backend onnxruntime-cuda. Throughput is about 60 percent of TensorRT-LLM, K-est is identical.Quantization can cost up to 0.05 K-points. Confirm the drop is within your task tolerance before procurement signs off. Run kolm verify on the source .kolm to regenerate the binder, and run kolm bench against a small eval set on the Jetson to compare.
# on the source machine $ kolm verify phi-redactor.kolm --binder report.html # on the Jetson, after pulling a small eval set $ kolm bench phi-redactor.jetson-orin-nano.trt-engine --evals ~/evals/sample.jsonl
For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.