LLM Inference & Serving

vLLM with PagedAttention is the production standard for self-hosted serving. AWQ is the default quantization for GPU; GGUF for CPU/Apple Silicon. The self-hosted crossover point is roughly $15-20K/month API spend.

Serving Frameworks

vLLM (Dominant)

The standard for self-hosted LLM serving. Key innovations:

PagedAttention: Manages KV cache like OS virtual memory — pages allocated on demand, near-zero waste. 2-4x throughput over naive implementations
Continuous batching: New requests join without waiting for the longest sequence to finish. Critical for production latency
Speculative decoding: Draft model generates candidates, target model verifies in parallel. Real speedup: 1.5-2.5x for code/structured output
Tensor parallelism: Split model across GPUs for large models

Production deployment: Load balancer with autoscaling based on queue depth, not CPU/GPU utilization.

Text Generation Inference (TGI, HuggingFace)

Alternative to vLLM. Strengths: tighter HuggingFace Hub integration, flash attention, watermarking. Generally similar performance to vLLM; choose based on ecosystem preference.

Ollama

For development and low-QPS internal tools. Not designed for production throughput. See Ollama Local LLM Runner.

Quantization Methods

Method	Bits	Quality Loss	Memory Reduction	Best For
FP16	16	None (baseline)	1x	Maximum quality
AWQ	4	~1-2% on benchmarks	2-3x	GPU production serving
GPTQ	4	~1-3%	2-3x	Batch processing, less latency-sensitive
GGUF	2-8 (mixed)	Varies by quant level	2-6x	CPU/Apple Silicon (llama.cpp/Ollama)

AWQ has emerged as the default for GPU serving. GGUF Q4_K_M (4-bit mixed precision) gives surprisingly good quality — often within 2-3% of FP16 on practical tasks.

Rule of thumb: AWQ for production GPU serving, GGUF for everything else (dev, edge, Mac).

Hardware Guide

GPU	VRAM	Cloud Cost (approx)	Sweet Spot
H100 SXM	80GB	$2-3/hr	Frontier serving, training
A100 80GB	80GB	$1.5-2/hr	Production workhorse, best price/perf
L40S	48GB	$1-1.5/hr	70B quantized, inference-focused
RTX 4090	24GB	$0.3-0.5/hr	7B-13B models, development
Apple M-series	32-192GB unified	One-time	Local dev, viable for 70B with GGUF

CPU inference: Viable for small models (7B quantized) at ~10-30 tok/s via llama.cpp. Not competitive for production throughput but fine for low-QPS internal tools.

Cloud Inference Providers

Provider	Differentiator	Best For
Groq	LPU hardware, ultra-low latency (~100-200 tok/s)	Latency-sensitive, interactive apps
Cerebras	Wafer-scale, massive throughput	High-throughput batch processing
Together AI	Wide model selection, competitive pricing	General-purpose open model hosting
Fireworks AI	Fast function calling	Tool-use heavy applications

Self-Hosted vs API Decision

Monthly API spend < $5K    → API (always)
$5K - $15K                 → API (unless latency/privacy requirements)
$15K - $50K                → Evaluate self-hosted (crossover zone)
> $50K                     → Self-hosted likely wins

Hidden costs of self-hosted: 1-2 FTE-months upfront engineering, 0.25 FTE ongoing maintenance, GPU procurement/reservation, monitoring infrastructure.

When to self-host regardless of cost: Data privacy requirements (healthcare, finance), latency requirements (<50ms), model customization needs, or regulatory compliance.

Speculative Decoding

Draft model (small, fast) generates N candidate tokens. Target model (large, accurate) verifies all N in a single forward pass. Accepted tokens are free — only rejected tokens cost a re-generation.

Real speedup: 1.5-2.5x for structured/predictable output (code, JSON). Less benefit for creative/unpredictable text. Works best when draft model is a good predictor of the target.

Serving Architecture Patterns

Single Model

Simplest. One model behind a load balancer. Good for single-purpose applications.

Model Router

Cheap classifier directs requests to different models based on complexity. See AI Cost Optimization for the tiering pattern.

A/B Serving

Two model versions behind a router with traffic splitting. Essential for eval-driven model upgrades.

KahWei's Wiki

Explorer

LLM Inference and Serving

LLM Inference & Serving

Serving Frameworks

vLLM (Dominant)

Text Generation Inference (TGI, HuggingFace)

Ollama

Quantization Methods

Hardware Guide

Cloud Inference Providers

Self-Hosted vs API Decision

Speculative Decoding

Serving Architecture Patterns

Single Model

Model Router

A/B Serving

Sources

Graph View

Table of Contents

Backlinks

KahWei's Wiki

Explorer

LLM Inference and Serving

LLM Inference & Serving

Serving Frameworks

vLLM (Dominant)

Text Generation Inference (TGI, HuggingFace)

Ollama

Quantization Methods

Hardware Guide

Cloud Inference Providers

Self-Hosted vs API Decision

Speculative Decoding

Serving Architecture Patterns

Single Model

Model Router

A/B Serving

Related Pages

Sources

Graph View

Table of Contents

Backlinks