Optimized LLM serving recipes for RTX 3090 setups

456

+22/day

ShellAI/ML📈 Breakout

Community recipes for serving LLMs on RTX 3090. Multi-engine (vLLM, llama.cpp, SGLang) and model-agnostic. Currently shipping Qwen3.6-27B configs for 1× and 2× cards.

AI Analysis

Validated Docker configs and benchmarks for running LLMs on single or dual RTX 3090 GPUs.

Built for Engineers and homelab enthusiasts looking to deploy high-performance LLMs on consumer hardware.

From the README

club-3090

Recipes for serving LLMs locally on RTX 3090s. Multi-engine (vLLM, llama.cpp, SGLang), multi-model, model-agnostic by design.

If you have one or two RTX 3090s and want to run modern LLMs at home, in a homelab, or as a dev backend — this repo collects the working configs, patches, and benchmarks.

TL;DR — what this is

Two complementary routes — pick by what your workload breaks on:
- 🏎 vLLM dual = max throughput. Up to 127 TPS code (DFlash) or 4 concurrent streams @ 262K (turbo). Full feature stack (vision · tools · MTP · streaming).
- 🛡 llama.cpp single = max robustness. Full 262K context on one 3090. Stress-tested clean: no prefill cliffs, 25K-token tool returns work, 90K needle ladder passes. Slower (~21 TPS) but doesn't crash on real-world tool-using agents.
Validated docker compose configs for both routes — drop-in OpenAI-compatible API on localhost:8020
Multi-engine: vLLM (full features), llama.cpp (max ctx + robustness), SGLang (currently blocked, watch list)
Model-agnostic: today ships configs for Qwen3.6-27B; structure scales as we add models

First time here? → Models — pick yours. Already running, want to compare engines? → docs/engines/ Hardware questions (does this work on a 4090, do I need NVLink)? → docs/HARDWARE.md Don't know what TPS / KV / MTP mean? → docs/GLOSSARY.md

Pick your path

| You have | Start here | |---|---| | 1× RTX 3090 | docs/SINGLE_CARD.md — workload → config → quick start | | 2× RTX 3090 (PCIe / no NVLink) | docs/DUAL_CARD.md — workload → config → quick start | | Considering self-host vs cloud APIs | docs/COMPARISONS.md — cost crossover + when each wins |

Each hardware page lists every supported model with the working composes for that card count, plus measured TPS and per-workload pitfalls. Model-specific deep dives (quants, Genesis patches, engine internals) live under models//.

Supported models

| Model | Status | Card counts | Engines | Highlights | |---|---|---|---|---| | Qwen3.6-27B | Production-ready ⭐ | 1× / 2× 3090 | vLLM ✅ · llama.cpp ✅ · SGLang ❌ blocked | Vision · tools · MTP n=3 · up to 262K ctx · vLLM dual = 89/127 TPS · llama.cpp single = full 262K, no prefill cliffs |

More models coming. The repo structure scales — when we add Qwen3.5-27B / GLM-4.6 / etc., they go under models// with the same internal pattern.

Measured TPS at a glance

Bench protocol: 3 warm + 5 measured runs of the canonical narrative + code prompts. Substrate: vLLM nightly 0.20.1rc1.dev16+g7a1eb8ac2 + Genesis v7.65 dev tip (commit d89a089), llama.cpp mainline 0d0764dfd, RTX 3090 sm_86 PCIe-only at 230 W. Per-config details + run-by-run numbers + VRAM + AL/accept rates: [models/qwen3.6-27b/CHANGELOG.md](models/qwen3.6-27b/CHA

View on GitHub