Apple Silicon runs MLX natively. There is no transfer step in the SSH sense, you compile and run on the same machine. M3 Pro at 18GB unified memory comfortably hosts 7B int4. M3 Max at 36GB or 64GB has room for 8B int8 or 13B int4 with a generous context window.
MLX is Apple's array framework for Apple Silicon. mlx-lm is the LLM-specific layer on top. Both install via pip. Use a venv to avoid clobbering the system Python.
$ python3 -m venv ~/kolm-mlx $ source ~/kolm-mlx/bin/activate $ pip install --upgrade pip $ pip install mlx mlx-lm # sanity check $ python -c "import mlx.core as mx; print(mx.default_device())" # expected: Device(gpu, 0)
If the device check prints cpu, your Python is x86 under Rosetta. Reinstall with an arm64 native Python from python.org or Homebrew.
.kolm.The MLX export packs the base weights, the tokenizer, and the manifest into a directory layout that mlx_lm.generate can load directly. No copy needed afterwards, Apple Silicon runs against local disk.
$ kolm export your-artifact.kolm \ --backend mlx \ --device "M3 Pro MacBook Pro (18GB)" \ --quant int4 \ --out ./exports/
The output is a directory at ./exports/your-artifact-int4/ containing the MLX weight shards, tokenizer config, and the kolm manifest. Move it to wherever your project lives.
$ mv ./exports/your-artifact-int4 ~/models/
Single-shot generation:
$ python -m mlx_lm.generate \ --model ~/models/your-artifact-int4 \ --prompt "Summarize this paragraph in one sentence." \ --max-tokens 256 \ --temp 0.2
For an OpenAI-compatible HTTP server (handy for hooking into Cursor, an internal app, or any OpenAI client):
$ python -m mlx_lm.server \ --model ~/models/your-artifact-int4 \ --port 8080
Server binds to http://127.0.0.1:8080/v1/chat/completions. Point your client at that URL and you have a private, on-device endpoint that does not leave the laptop.
Regenerate the binder on the same Mac. Compare K-score on a small eval set against the fp16 reference to confirm quantization loss is acceptable for your task.
$ kolm verify your-artifact.kolm --binder report.html $ kolm bench your-artifact.kolm --device "M3 Pro" --evals ./your-evals.jsonl
For reviewer-grade evidence, /verify-prod accepts the same .kolm in the browser and runs the same six checks.