A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.
No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.
/v1/chat/completions, streaming, etc).--gpu-layers) and Wisdom Auto-Calibration for squeezing massive models into RAM.SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.
Recent reproductions of the TurboQuant algorithm (e.g., turboquant-mlx) revealed two distinct paths:
We built the "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (bggml-metal) shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.
K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim
x̂ = x / ‖x‖V-Cache (3-bit PolarQuant) = 3.125 bits/dim Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.
Reference implementations: turboquant-mlx | turboquant_plus | Paper: TurboQuant, Google 2504.19874
To reliably run massive 122B parameter MoE models over SSD streaming, SwiftLM was designed and benchmarked natively on the following hardware:
⚠️ Quantization Disclaimer: While heavier quantization shrinks the required memory footprint, 4-bit quantization remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like
\name\instead of"name"—which systematically breaks OpenAI-compatible tool calling.
A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.
mlx-community model by namecd SwiftLMChat python3 generate_xcodeproj.py # Generates SwiftLMChat.xcodeproj open SwiftLMChat.xcodeproj
Then in Xcode:
Note for contributors: The
.xcodeprojis git-ignored (it contains your personal Team ID). Rungenerate_xcodeproj.pyafter cloning to regenerate it locally. Your Team ID is never committed.
Download the latest release tarball from the Releases page. The archive is self-contained — default.metallib is bundled alongside the binary.
tar -xzf SwiftLM-<version>-macos-arm64.tar.gz
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413
⚠️ Metal GPU Error? If you see
Failed to load the default metallib, it meansdefault.metallibis missing from the directory you are runningSwiftLMfrom. Make sure you run the binary from the extracted folder and do not move the binary without also movingdefault.metallibalongside it.
git clone --recursive https://github.com/SharpAI/SwiftLM cd SwiftLM swift build -c release
default.metallib is a pre-built artifact inside the mlx-swift submodule, version-matched to the Swift binary. Copy it next to the binary before running:
cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/
.build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413
⚠️ Do NOT use Python's
mlx-metalpackage as a source formlx.metallib.
Whileuv run --with mlx-metal python -c "...shutil.copy(metallib, ...)"will get the server to start, the pipmlx-metalpackage is a different version of MLX than what this binary was compiled against. The version mismatch causes GPU kernel ABI corruption during inference, producing afreed pointer was not the last allocationcrash. Always use the metallib fromLocalPackages/mlx-swift/— it is the only version-matched artifact for this build.
(Add --stream-experts when running oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory swapping and stream expert layers directly from NVMe.)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Server health + loaded model capabilities |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completions (LLM and VLM support, multi-turn, system prompts) |
Drop-in compatible with standard OpenAI HTTP consumers:
curl http://localhost:5413/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3.5-122B-A10B-4bit", "stream": true, "messages": [ {"role": "system", "content": "You are Aegis-AI, a local home security agent. Output strictly in JSON format."}, {"role": "user", "content": "Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string."} ] }'
| Option | Default | Description |
|---|---|---|
--model |
(required) | HuggingFace model ID or local path |
--port |
5413 |
Port to listen on |
--host |
127.0.0.1 |
Host to bind |
--max-tokens |
2048 |
Max tokens limit per generation |
--gpu-layers |
model_default |
Restrict the amount of layers allocated to GPU hardware |
--stream-experts |
false |
Enable experimental SSD streaming for MoE model expert matrices |
xcodebuild -downloadComponent MetalToolchain)Built entirely on the hard work of the Apple MLX community.
The TurboQuant KV cache compression implemented in SwiftLM is directly based on the following open-source work and research:
TheTom/llama-cpp-turboquant — The primary reference for the C and Metal GPU implementation. The turbo-wht.h Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the ggml-turbo-quant.c quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
TheTom/turboquant_plus — Python reference implementation used to validate the algorithm math, codebook construction (Lloyd's algorithm for N(0, 1/d)), and KV cache integration design.
TurboQuant Paper — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", Zandieh et al., AISTATS/ICLR 2026. The two-stage PolarQuant + QJL algorithm described in Section 3 and Appendix A is the mathematical foundation of this implementation.
amirzandieh/QJL — Original Quantized Johnson-Lindenstrauss (QJL) 1-bit residual correction implementation by the paper authors.
MIT License