Apple Silicon quickstart · MLX · unified memory

M3 MacBook Pro: install MLX, export, run.

Apple Silicon runs MLX natively. There is no transfer step in the SSH sense, you compile and run on the same machine. M3 Pro at 18GB unified memory comfortably hosts 7B int4. M3 Max at 36GB or 64GB has room for 8B int8 or 13B int4 with a generous context window.

ChipM3 / M3 Pro / M3 Max
Unified memory18 / 36 / 64 / 128 GB
BackendMLX (mlx_lm)
Throughput (7B int4)25 to 70 tok/s
Python3.10 or later
OSmacOS 14 (Sonoma) or later

Step 1. Install MLX and mlx-lm.

MLX is Apple's array framework for Apple Silicon. mlx-lm is the LLM-specific layer on top. Both install via pip. Use a venv to avoid clobbering the system Python.

$ python3 -m venv ~/kolm-mlx
$ source ~/kolm-mlx/bin/activate
$ pip install --upgrade pip
$ pip install mlx mlx-lm
# sanity check
$ python -c "import mlx.core as mx; print(mx.default_device())"
# expected: Device(gpu, 0)

If the device check prints cpu, your Python is x86 under Rosetta. Reinstall with an arm64 native Python from python.org or Homebrew.

Step 2. Export from your .kolm.

The MLX export packs the base weights, the tokenizer, and the manifest into a directory layout that mlx_lm.generate can load directly. No copy needed afterwards, Apple Silicon runs against local disk.

$ kolm export your-artifact.kolm \
              --backend mlx \
              --device "M3 Pro MacBook Pro (18GB)" \
              --quant int4 \
              --out ./exports/

The output is a directory at ./exports/your-artifact-int4/ containing the MLX weight shards, tokenizer config, and the kolm manifest. Move it to wherever your project lives.

$ mv ./exports/your-artifact-int4 ~/models/

Step 3. Run on the Mac.

Single-shot generation:

$ python -m mlx_lm.generate \
              --model ~/models/your-artifact-int4 \
              --prompt "Summarize this paragraph in one sentence." \
              --max-tokens 256 \
              --temp 0.2

For an OpenAI-compatible HTTP server (handy for hooking into Cursor, an internal app, or any OpenAI client):

$ python -m mlx_lm.server \
              --model ~/models/your-artifact-int4 \
              --port 8080

Server binds to http://127.0.0.1:8080/v1/chat/completions. Point your client at that URL and you have a private, on-device endpoint that does not leave the laptop.

What fits on each M3 tier.

Unified memory is the unlock. Unlike a discrete GPU where weights have to cross PCIe, MLX reads directly from the same memory the CPU uses. That is why a 36GB M3 Max can host a model that needs a 24GB-VRAM discrete GPU. The tradeoff is bandwidth: 400 GB/s (Max) is below the 1 TB/s of an RTX 4090, so steady-state tok/s is lower per dollar but the upper memory ceiling is much higher.

Verify the artifact stayed honest.

Regenerate the binder on the same Mac. Compare K-score on a small eval set against the fp16 reference to confirm quantization loss is acceptable for your task.

$ kolm verify your-artifact.kolm --binder report.html
$ kolm bench your-artifact.kolm --device "M3 Pro" --evals ./your-evals.jsonl

For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.

Next.