Device transfer: M3 Mac quickstart (MLX)

ChipM3 / M3 Pro / M3 Max

Unified memory18 / 36 / 64 / 128 GB

BackendMLX (mlx_lm)

Throughput (7B int4)25 to 70 tok/s

Python3.10 or later

OSmacOS 14 (Sonoma) or later

Step 1. Install MLX and mlx-lm.

MLX is Apple's array framework for Apple Silicon. mlx-lm is the LLM-specific layer on top. Both install via pip. Use a venv to avoid clobbering the system Python.

$ python3 -m venv ~/kolm-mlx
$ source ~/kolm-mlx/bin/activate
$ pip install --upgrade pip
$ pip install mlx mlx-lm
# sanity check
$ python -c "import mlx.core as mx; print(mx.default_device())"
# expected: Device(gpu, 0)

If the device check prints cpu, your Python is x86 under Rosetta. Reinstall with an arm64 native Python from python.org or Homebrew.

Step 2. Export from your `.kolm`.

The MLX export packs the base weights, the tokenizer, and the manifest into a directory layout that mlx_lm.generate can load directly. No copy needed afterwards, Apple Silicon runs against local disk.

$ kolm export your-artifact.kolm \
              --backend mlx \
              --device "M3 Pro MacBook Pro (18GB)" \
              --quant int4 \
              --out ./exports/

The output is a directory at ./exports/your-artifact-int4/ containing the MLX weight shards, tokenizer config, and the kolm manifest. Move it to wherever your project lives.

$ mv ./exports/your-artifact-int4 ~/models/

Step 3. Run on the Mac.

Single-shot generation:

$ python -m mlx_lm.generate \
              --model ~/models/your-artifact-int4 \
              --prompt "Summarize this paragraph in one sentence." \
              --max-tokens 256 \
              --temp 0.2

For an OpenAI-compatible HTTP server (handy for hooking into Cursor, an internal app, or any OpenAI client):

$ python -m mlx_lm.server \
              --model ~/models/your-artifact-int4 \
              --port 8080

Server binds to http://127.0.0.1:8080/v1/chat/completions. Point your client at that URL and you have a private, on-device endpoint that does not leave the laptop.

What fits on each M3 tier.

M3 (8GB): Llama-3.2-1B and 3B int4 comfortably. 7B int4 is tight at 4.4 GB plus 1 GB runtime overhead on an 8 GB chip.
M3 Pro (18GB): 7B int4 and int8, 8B int4 (4.4 GB), 13B int4 (about 7.5 GB) with headroom. The picker's reference rate.
M3 Max (36 or 64 GB): 13B int8, 30B int4, or two 7B artifacts in parallel for a router workload.
M3 Ultra (Mac Studio, up to 128GB): 70B int4 (about 40 GB) is in range. Throughput is memory-bandwidth bound, expect 10 to 15 tok/s.

Unified memory is the unlock. Unlike a discrete GPU where weights have to cross PCIe, MLX reads directly from the same memory the CPU uses. That is why a 36GB M3 Max can host a model that needs a 24GB-VRAM discrete GPU. The tradeoff is bandwidth: 400 GB/s (Max) is below the 1 TB/s of an RTX 4090, so steady-state tok/s is lower per dollar but the upper memory ceiling is much higher.

Verify the artifact stayed honest.

Regenerate the binder on the same Mac. Compare K-score on a small eval set against the fp16 reference to confirm quantization loss is acceptable for your task.

$ kolm verify your-artifact.kolm --binder report.html
$ kolm bench your-artifact.kolm --device "M3 Pro" --evals ./your-evals.jsonl

For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.

M3 MacBook Pro: install MLX, export, run.

Step 1. Install MLX and mlx-lm.

Step 2. Export from your `.kolm`.

Step 3. Run on the Mac.

What fits on each M3 tier.

Verify the artifact stayed honest.

Next.

M3 MacBook Pro: install MLX, export, run.

Step 1. Install MLX and mlx-lm.

Step 2. Export from your .kolm.

Step 3. Run on the Mac.

What fits on each M3 tier.

Verify the artifact stayed honest.

Next.

Step 2. Export from your `.kolm`.