Optimized LLM serving recipes for RTX 3090 setups
noonghunna/club-3090Community recipes for serving LLMs on RTX 3090. Multi-engine (vLLM, llama.cpp, SGLang) and model-agnostic. Currently shipping Qwen3.6-27B configs for 1ร and 2ร cards.
AI Analysis
Validated Docker configs and benchmarks for running LLMs on single or dual RTX 3090 GPUs.
Built for Engineers and homelab enthusiasts looking to deploy high-performance LLMs on consumer hardware.
From the README
club-3090
Recipes for serving LLMs locally on RTX 3090s. Multi-engine (vLLM, llama.cpp, SGLang), multi-model, model-agnostic by design.
If you have one or two RTX 3090s and want to run modern LLMs at home, in a homelab, or as a dev backend โ this repo collects the working configs, patches, and benchmarks.
TL;DR โ what this is
- Two complementary routes โ pick by what your workload breaks on:
- ๐ vLLM dual = max throughput. Up to 127 TPS code (DFlash) or 4 concurrent streams @ 262K (turbo). Full feature stack (vision ยท tools ยท MTP ยท streaming).
- ๐ก llama.cpp single = max robustness. Full 262K context on one 3090. Stress-tested clean: no prefill cliffs, 25K-token tool returns work, 90K needle ladder passes. Slower (~21 TPS) but doesn't crash on real-world tool-using agents.
- Validated docker compose configs for both routes โ drop-in OpenAI-compatible API on
localhost:8020 - Multi-engine: vLLM (full features), llama.cpp (max ctx + robustness), SGLang (currently blocked, watch list)
- Model-agnostic: today ships configs for Qwen3.6-27B; structure scales as we add models
First time here? โ Models โ pick yours. Already running, want to compare engines? โ docs/engines/ Hardware questions (does this work on a 4090, do I need NVLink)? โ docs/HARDWARE.md Don't know what TPS / KV / MTP mean? โ docs/GLOSSARY.md
Pick your path
| You have | Start here |
|---|---|
| 1ร RTX 3090 | docs/SINGLE_CARD.md โ workload โ config โ quick start |
| 2ร RTX 3090 (PCIe / no NVLink) | docs/DUAL_CARD.md โ workload โ config โ quick start |
| Considering self-host vs cloud APIs | docs/COMPARISONS.md โ cost crossover + when each wins |
Each hardware page lists every supported model with the working composes for that card count, plus measured TPS and per-workload pitfalls. Model-specific deep dives (quants, Genesis patches, engine internals) live under models//.
Supported models
| Model | Status | Card counts | Engines | Highlights | |---|---|---|---|---| | Qwen3.6-27B | Production-ready โญ | 1ร / 2ร 3090 | vLLM โ ยท llama.cpp โ ยท SGLang โ blocked | Vision ยท tools ยท MTP n=3 ยท up to 262K ctx ยท vLLM dual = 89/127 TPS ยท llama.cpp single = full 262K, no prefill cliffs |
More models coming. The repo structure scales โ when we add Qwen3.5-27B / GLM-4.6 / etc., they go under models// with the same internal pattern.
Measured TPS at a glance
Bench protocol: 3 warm + 5 measured runs of the canonical narrative + code prompts. Substrate: vLLM nightly 0.20.1rc1.dev16+g7a1eb8ac2 + Genesis v7.65 dev tip (commit d89a089), llama.cpp mainline 0d0764dfd, RTX 3090 sm_86 PCIe-only at 230 W. Per-config details + run-by-run numbers + VRAM + AL/accept rates: [models/qwen3.6-27b/CHANGELOG.md](models/qwen3.6-27b/CHA