m0at/rvllm

323

+32/day

Rust

rvLLM: High-performance LLM inference in Rust. Drop-in vLLM replacement.

From the README

rvLLM: High-performance LLM inference in Rust

A from-scratch Rust rewrite of vLLM -- the most popular open-source LLM serving engine. Drop-in replacement for the OpenAI-compatible API with dramatically better resource efficiency.

50 CUDA kernels. Rust PTX compiler with 2-7.5x faster codegen than nvcc. cuBLAS autotuning. CUDA graph replay. FP8 inference. 20x faster startup. 31x smaller binary.

rvLLM vs Python vLLM -- Head-to-Head

All measurements on H100 SXM 80GB, Qwen2.5-7B f16, separate GPU instances per engine. No cherry-picking -- same model, same hardware, same prompts.

Throughput

| Metric | rvLLM | Python vLLM 0.18 | Ratio | |---|---:|---:|---| | Direct engine tok/s (N=128) | 12,607 | 14,962 | 0.84x | | Direct engine tok/s (N=64) | 7,280 | 8,807 | 0.83x | | Direct engine tok/s (N=16) | 2,058 | 2,524 | 0.82x | | Direct engine tok/s (N=1) | 108 | 169 | 0.64x |

JIT Compiler: Our Fused Kernels vs Hand-Written CUDA

rvLLM includes a Rust-native PTX compiler that generates fused GPU kernels at model load time. These JIT kernels are 2-7.5x faster than our hand-written nvcc-compiled CUDA on H100:

| Fused Kernel | JIT (us) | Hand-written (us) | Speedup | |---|---:|---:|---| | Add+RMSNorm+QKV GEMV [1,4608,3584] | 5.5 | 10.6 | 1.92x | | Add+RMSNorm+GateUp GEMV [1,37888,3584] | 19.3 | 98.6 | 5.12x | | SiLU*Mul+Down GEMV [1,3584,18944] | 9.5 | 70.7 | 7.48x | | RMSNorm+QKV GEMV [1,4608,3584] | 5.3 | 10.8 | 2.03x |

The JIT compiler (crates/rvllm-fusion/src/ptx_emit.rs) emits PTX directly from Rust -- no nvcc, no Python, no Triton dependency. It generates shape-specialized kernels with vectorized loads, warp shuffle reductions, and shared memory tiling tuned for the specific model dimensions.

Per-step savings at N=1 (28 layers): 4.2ms = estimated 1.8x single-sequence speedup.

Efficiency

| Metric | rvLLM | Python vLLM 0.18 | Winner | |---|---:|---:|---| | Cold start to first token | 6 sec | ~120 sec | rvLLM 20x | | Binary size | 16 MB | ~500 MB | rvLLM 31x | | CPU memory at steady state | 348 MB | ~1 GB | rvLLM 3x | | Dependencies | 0 (static binary) | PyTorch + 500MB | rvLLM | | P95 latency spread | 34 ms (1.4%) | 190 ms (12%) | rvLLM 5.6x tighter | | CUDA graph capture | 1.7 sec (35 sizes) | ~60 sec (torch.compile) | rvLLM 35x | | cuBLAS autotuning | 170 ms (6 shapes) | ~60 sec (torch.compile) | rvLLM 350x |

No Python interpreter, no GIL, no garbage collector, no PyTorch tensor allocation. rvLLM's P95 tail is 5.6x tighter than vLLM's because there are no GC pauses, no JIT recompilations, no Python object churn.

Resource Usage (Qwen2.5-7B f16, H100 80GB)

| Metric | rvLLM | Python vLLM 0.18 | |---|---:|---:| | Model weight VRAM | 14.0 GB | 14.0 GB | | KV cache VRAM (0.9 util) | 48.5 GB | ~50 GB | | Peak GPU memory | 66.5 GB | ~72 GB | | FP8 weight support | Yes (cublasLt) | Y

View on GitHub