Back to feed

vllm-swift: High-performance LLM inference on Apple Silicon

TheTom/vllm-swift
218
+48/day
10
PythonAI/ML📈 Breakout

vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon

AI Analysis

A native Swift/Metal backend for vLLM that removes Python from the inference hot path.

Built for Engineers building local LLM applications on macOS who need to maximize inference throughput.

From the README

A native Swift/Metal backend for vLLM on Apple Silicon. No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path. OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

brew tap TheTom/tap && brew install vllm-swift

Or from source:

git clone  && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don't need activate.shvllm-swift serve handles everything.

Server running at ` (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.

Qwen3-0.6B

| | Single | 8 concurrent | 32 concurrent | 64 concurrent | |---|:---:|:---:|:---:|:---:| | vllm-swift | 364 | 1,527 | 2,859 | 3,425 | | vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |

Qwen3-4B

| | Single | 8 concurrent | 32 concurrent | 64 concurrent | |---|:---:|:---:|:---:|:---:| | vllm-swift | 147 | 477 | 1,194 | 1,518 | | vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K | |----------|:-----------:|:----------:|:----------:|:----------:|:----------:| | FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s | | turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s | | turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

  • OpenAI-compatible API (/v1/completions, /v1/chat/completions)
  • Streaming (SSE) responses
  • Chat templates (applied by vLLM, model-specific)
  • Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
  • Per-request temperature sampling in batched path
  • Auto model download from HuggingFace Hub
  • TurboQuant+ KV cac