device transfer / jetson orin nano
Jetson quickstart · Ampere GPU · TensorRT-LLM

Jetson Orin Nano: build engine, scp, run.

The Orin Nano 8GB is a 25-watt edge box that runs a 7B int4 artifact at about 38 tok/s once the TensorRT engine is built. The first run pays a JIT cost of roughly 90 seconds for engine compilation, then every run after is cached. ONNX Runtime CUDA is the fallback path if TensorRT-LLM is not yet on your JetPack.

ModuleJetson Orin Nano 8GB
GPUAmpere, 1024 CUDA cores, 32 Tensor
Compute67 TOPS sparse INT8
Memory8 GB LPDDR5 unified
BackendTensorRT-LLM (fallback: ONNX Runtime CUDA)
Power7W to 25W (selectable)

Forecast for a 7B int4 artifact.

Output artifactphi-redactor.jetson-orin-nano.trt-engine
Engine sizeabout 1.4 GB
Throughput38 tok/s (7B INT4, public benchmarks)
K-score (estimated)0.88 (drift from int4 quant)
Fits in memoryyes (1.4 GB engine + KV cache fits under 8 GB)
First-run JITabout 90 seconds (TensorRT engine build, cached after)

Step 1. Flash the Jetson with JetPack 6.

JetPack 6 ships CUDA 12, cuDNN 9, and TensorRT 10 out of the box. Use NVIDIA SDK Manager from a host Linux box, or flash a fresh microSD via the L4T image. After boot, confirm the CUDA toolchain is on PATH.

$ sudo apt update
$ sudo apt install -y python3-pip python3-venv git
$ nvcc --version
# expected: release 12.x, build cuda_12.x
$ nvidia-smi
# expected: Orin (nvgpu) device listed

Step 2. Install TensorRT-LLM (preferred) or ONNX Runtime CUDA (fallback).

TensorRT-LLM is the NVIDIA-blessed path for LLM inference on Jetson. It compiles a static engine that runs at near-peak Ampere throughput. ONNX Runtime CUDA is the portable fallback if a JetPack release lags behind TensorRT-LLM upstream.

# TensorRT-LLM (preferred)
$ python3 -m venv ~/kolm-trt
$ source ~/kolm-trt/bin/activate
$ pip install --upgrade pip
$ pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com

# ONNX Runtime CUDA (fallback)
$ pip install onnxruntime-gpu optimum[exporters]

If tensorrt_llm import fails, the JetPack version is older than the wheel. Either upgrade JetPack or use the ONNX fallback. The kolm export command handles both.

Step 3. Export from your .kolm on the source machine.

Run this on whatever box you compiled the .kolm on, not the Jetson. The export produces a serialized TensorRT engine targeted at Orin Nano (sm_87). Use --quant int4 for the 8 GB module.

$ kolm export phi-redactor.kolm --device jetson-orin-nano --backend trt-llm --quant int4
# output: phi-redactor.jetson-orin-nano.trt-engine (about 1.4 GB)

The output is a single file at phi-redactor.jetson-orin-nano.trt-engine plus a small kolm.manifest.json for verify. Both ship together.

Step 4. Move it to the Jetson.

Jetson Orin Nano DevKit ships with Ubuntu and SSH ready out of the box. Use scp for one-shot transfers, rsync if the LAN is flaky.

$ scp phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/
$ scp kolm.manifest.json nvidia@jetson.local:~/engines/
# resumable alternative:
$ rsync -avP phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/

Step 5. Run on the Jetson.

SSH into the Jetson and invoke kolm run against the engine file. The first run JIT-builds the runtime context, which takes about 90 seconds. After that the engine is cached in ~/.kolm/trt-cache and every run is instant.

$ ssh nvidia@jetson.local
$ kolm run phi-redactor.jetson-orin-nano.trt-engine 'MRN 123456 SSN 555-44-3333'
# first run: about 90s for engine warmup, then 38 tok/s steady state
# subsequent runs: instant warmup

For an OpenAI-compatible HTTP server, use the TensorRT-LLM serve harness directly:

$ trtllm-serve --engine_dir ~/engines/phi-redactor.jetson-orin-nano.trt-engine --port 8080
# OpenAI-compatible endpoint at http://jetson.local:8080/v1/chat/completions

Throughput envelope on Orin Nano.

Power mode matters. The Orin Nano defaults to a 15W power profile out of the box. Push it to 25W with sudo nvpmodel -m 0 and sudo jetson_clocks to get the 38 tok/s figure above. At 7W (battery mode), expect about 18 tok/s on the same workload.

Troubleshooting.

Verify the artifact stayed honest.

Quantization can cost up to 0.05 K-points. Confirm the drop is within your task tolerance before procurement signs off. Run kolm verify on the source .kolm to regenerate the binder, and run kolm bench against a small eval set on the Jetson to compare.

# on the source machine
$ kolm verify phi-redactor.kolm --binder report.html

# on the Jetson, after pulling a small eval set
$ kolm bench phi-redactor.jetson-orin-nano.trt-engine --evals ~/evals/sample.jsonl

For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.

References.

Next.