Device transfer: Jetson Orin Nano quickstart (TensorRT-LLM)

ModuleJetson Orin Nano 8GB

GPUAmpere, 1024 CUDA cores, 32 Tensor

Compute67 TOPS sparse INT8

Memory8 GB LPDDR5 unified

BackendTensorRT-LLM (fallback: ONNX Runtime CUDA)

Power7W to 25W (selectable)

Forecast for a 7B int4 artifact.

Output artifact	`phi-redactor.jetson-orin-nano.trt-engine`
Engine size	about 1.4 GB
Throughput	38 tok/s (7B INT4, public benchmarks)
K-score (estimated)	0.88 (drift from int4 quant)
Fits in memory	yes (1.4 GB engine + KV cache fits under 8 GB)
First-run JIT	about 90 seconds (TensorRT engine build, cached after)

Step 1. Flash the Jetson with JetPack 6.

JetPack 6 ships CUDA 12, cuDNN 9, and TensorRT 10 out of the box. Use NVIDIA SDK Manager from a host Linux box, or flash a fresh microSD via the L4T image. After boot, confirm the CUDA toolchain is on PATH.

$ sudo apt update
$ sudo apt install -y python3-pip python3-venv git
$ nvcc --version
# expected: release 12.x, build cuda_12.x
$ nvidia-smi
# expected: Orin (nvgpu) device listed

Step 2. Install TensorRT-LLM (preferred) or ONNX Runtime CUDA (fallback).

TensorRT-LLM is the NVIDIA-blessed path for LLM inference on Jetson. It compiles a static engine that runs at near-peak Ampere throughput. ONNX Runtime CUDA is the portable fallback if a JetPack release lags behind TensorRT-LLM upstream.

# TensorRT-LLM (preferred)
$ python3 -m venv ~/kolm-trt
$ source ~/kolm-trt/bin/activate
$ pip install --upgrade pip
$ pip install tensorrt_llm --extra-index-url https://pypi.nvidia.com

# ONNX Runtime CUDA (fallback)
$ pip install onnxruntime-gpu optimum[exporters]

If tensorrt_llm import fails, the JetPack version is older than the wheel. Either upgrade JetPack or use the ONNX fallback. The kolm export command handles both.

Step 3. Export from your `.kolm` on the source machine.

Run this on whatever box you compiled the .kolm on, not the Jetson. The export produces a serialized TensorRT engine targeted at Orin Nano (sm_87). Use --quant int4 for the 8 GB module.

$ kolm export phi-redactor.kolm --device jetson-orin-nano --backend trt-llm --quant int4
# output: phi-redactor.jetson-orin-nano.trt-engine (about 1.4 GB)

The output is a single file at phi-redactor.jetson-orin-nano.trt-engine plus a small kolm.manifest.json for verify. Both ship together.

Step 4. Move it to the Jetson.

Jetson Orin Nano DevKit ships with Ubuntu and SSH ready out of the box. Use scp for one-shot transfers, rsync if the LAN is flaky.

$ scp phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/
$ scp kolm.manifest.json nvidia@jetson.local:~/engines/
# resumable alternative:
$ rsync -avP phi-redactor.jetson-orin-nano.trt-engine nvidia@jetson.local:~/engines/

Step 5. Run on the Jetson.

SSH into the Jetson and invoke kolm run against the engine file. The first run JIT-builds the runtime context, which takes about 90 seconds. After that the engine is cached in ~/.kolm/trt-cache and every run is instant.

$ ssh nvidia@jetson.local
$ kolm run phi-redactor.jetson-orin-nano.trt-engine 'MRN 123456 SSN 555-44-3333'
# first run: about 90s for engine warmup, then 38 tok/s steady state
# subsequent runs: instant warmup

For an OpenAI-compatible HTTP server, use the TensorRT-LLM serve harness directly:

$ trtllm-serve --engine_dir ~/engines/phi-redactor.jetson-orin-nano.trt-engine --port 8080
# OpenAI-compatible endpoint at http://jetson.local:8080/v1/chat/completions

Throughput envelope on Orin Nano.

7B int4 (default): 38 tok/s steady state. The picker's reference rate.
7B int8: about 22 tok/s. Higher fidelity, tighter memory budget.
3B int4: about 75 tok/s. Best for interactive chat, leaves headroom for batch.
3B int8: about 48 tok/s. Comfortable interactive ceiling.
13B int4: does not fit on Orin Nano 8GB. Use the Orin AGX 32 or 64 GB module instead.

Power mode matters. The Orin Nano defaults to a 15W power profile out of the box. Push it to 25W with sudo nvpmodel -m 0 and sudo jetson_clocks to get the 38 tok/s figure above. At 7W (battery mode), expect about 18 tok/s on the same workload.

Troubleshooting.

Out of memory at engine load: drop to int3 (--quant int3). This brings K-est to about 0.84, which may fail your ship gate. Verify against the eval pack before going lower than int4.
Engine build slow on first run: expected. The ~90s JIT compile happens once per artifact, then the engine is cached in ~/.kolm/trt-cache and every subsequent run is instant.
Low throughput (under 25 tok/s): check tegrastats for GPU clock. If the clock drops below 800 MHz the module is thermal-throttling. Add a fan, or run sudo jetson_clocks to pin the clock at max.
ONNX fallback is needed: if tensorrt_llm wheel does not import, fall back to --backend onnxruntime-cuda. Throughput is about 60 percent of TensorRT-LLM, K-est is identical.

Verify the artifact stayed honest.

Quantization can cost up to 0.05 K-points. Confirm the drop is within your task tolerance before procurement signs off. Run kolm verify on the source .kolm to regenerate the binder, and run kolm bench against a small eval set on the Jetson to compare.

# on the source machine
$ kolm verify phi-redactor.kolm --binder report.html

# on the Jetson, after pulling a small eval set
$ kolm bench phi-redactor.jetson-orin-nano.trt-engine --evals ~/evals/sample.jsonl

For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.

Jetson Orin Nano: build engine, scp, run.

Forecast for a 7B int4 artifact.

Step 1. Flash the Jetson with JetPack 6.

Step 2. Install TensorRT-LLM (preferred) or ONNX Runtime CUDA (fallback).

Step 3. Export from your `.kolm` on the source machine.

Step 4. Move it to the Jetson.

Step 5. Run on the Jetson.

Throughput envelope on Orin Nano.

Troubleshooting.

Verify the artifact stayed honest.

References.

Next.

Jetson Orin Nano: build engine, scp, run.

Forecast for a 7B int4 artifact.

Step 1. Flash the Jetson with JetPack 6.

Step 2. Install TensorRT-LLM (preferred) or ONNX Runtime CUDA (fallback).

Step 3. Export from your .kolm on the source machine.

Step 4. Move it to the Jetson.

Step 5. Run on the Jetson.

Throughput envelope on Orin Nano.

Troubleshooting.

Verify the artifact stayed honest.

References.

Next.

Step 3. Export from your `.kolm` on the source machine.