vllm-swift: High-performance LLM inference on Apple Silicon
TheTom/vllm-swiftvLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon
AI Analysis
A native Swift/Metal backend for vLLM that removes Python from the inference hot path.
Built for Engineers building local LLM applications on macOS who need to maximize inference throughput.
From the README
A native Swift/Metal backend for vLLM on Apple Silicon. No Python in the inference hot path.
Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path. OpenAI-compatible API. Up to 2.6× faster short-context decode.
Quick Start
1. Install
brew tap TheTom/tap && brew install vllm-swift
Or from source:
git clone && cd vllm-swift
./scripts/install.sh # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH (generated by install.sh)
2. Run
vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096 # increase as needed, max 40960
Homebrew users don't need
activate.sh—vllm-swift servehandles everything.
Server running at ` (OpenAI-compatible API).
Drop-in replacement for vLLM on Apple Silicon. All
vllm serveflags work unchanged.
Performance (M5 Max 128GB)
Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.
Qwen3-0.6B
| | Single | 8 concurrent | 32 concurrent | 64 concurrent | |---|:---:|:---:|:---:|:---:| | vllm-swift | 364 | 1,527 | 2,859 | 3,425 | | vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |
Qwen3-4B
| | Single | 8 concurrent | 32 concurrent | 64 concurrent | |---|:---:|:---:|:---:|:---:| | vllm-swift | 147 | 477 | 1,194 | 1,518 | | vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |
Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.
TurboQuant+ KV Cache Compression
TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.
Qwen3.5 2B (4-bit weights)
| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K | |----------|:-----------:|:----------:|:----------:|:----------:|:----------:| | FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s | | turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s | | turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |
Architecture
The entire forward pass runs in Swift/Metal. Python is used only for orchestration.
Python (vLLM API, tokenization, scheduling) ← github.com/vllm-project/vllm
↓ ctypes FFI
C bridge (bridge.h)
↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
↓
Metal GPU
Features
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) - Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with
BatchedKVCache(fully batched projections + attention) - Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- TurboQuant+ KV cac