LLM Inference & Serving

vLLM with PagedAttention is the production standard for self-hosted serving. AWQ is the default quantization for GPU; GGUF for CPU/Apple Silicon. The self-hosted crossover point is roughly $15-20K/month API spend.

Serving Frameworks

vLLM (Dominant)

The standard for self-hosted LLM serving. Key innovations:

  • PagedAttention: Manages KV cache like OS virtual memory — pages allocated on demand, near-zero waste. 2-4x throughput over naive implementations
  • Continuous batching: New requests join without waiting for the longest sequence to finish. Critical for production latency
  • Speculative decoding: Draft model generates candidates, target model verifies in parallel. Real speedup: 1.5-2.5x for code/structured output
  • Tensor parallelism: Split model across GPUs for large models

Production deployment: Load balancer with autoscaling based on queue depth, not CPU/GPU utilization.

Text Generation Inference (TGI, HuggingFace)

Alternative to vLLM. Strengths: tighter HuggingFace Hub integration, flash attention, watermarking. Generally similar performance to vLLM; choose based on ecosystem preference.

Ollama

For development and low-QPS internal tools. Not designed for production throughput. See Ollama Local LLM Runner.

Quantization Methods

MethodBitsQuality LossMemory ReductionBest For
FP1616None (baseline)1xMaximum quality
AWQ4~1-2% on benchmarks2-3xGPU production serving
GPTQ4~1-3%2-3xBatch processing, less latency-sensitive
GGUF2-8 (mixed)Varies by quant level2-6xCPU/Apple Silicon (llama.cpp/Ollama)

AWQ has emerged as the default for GPU serving. GGUF Q4_K_M (4-bit mixed precision) gives surprisingly good quality — often within 2-3% of FP16 on practical tasks.

Rule of thumb: AWQ for production GPU serving, GGUF for everything else (dev, edge, Mac).

Hardware Guide

GPUVRAMCloud Cost (approx)Sweet Spot
H100 SXM80GB$2-3/hrFrontier serving, training
A100 80GB80GB$1.5-2/hrProduction workhorse, best price/perf
L40S48GB$1-1.5/hr70B quantized, inference-focused
RTX 409024GB$0.3-0.5/hr7B-13B models, development
Apple M-series32-192GB unifiedOne-timeLocal dev, viable for 70B with GGUF

CPU inference: Viable for small models (7B quantized) at ~10-30 tok/s via llama.cpp. Not competitive for production throughput but fine for low-QPS internal tools.

Cloud Inference Providers

ProviderDifferentiatorBest For
GroqLPU hardware, ultra-low latency (~100-200 tok/s)Latency-sensitive, interactive apps
CerebrasWafer-scale, massive throughputHigh-throughput batch processing
Together AIWide model selection, competitive pricingGeneral-purpose open model hosting
Fireworks AIFast function callingTool-use heavy applications

Self-Hosted vs API Decision

Monthly API spend < $5K    → API (always)
$5K - $15K                 → API (unless latency/privacy requirements)
$15K - $50K                → Evaluate self-hosted (crossover zone)
> $50K                     → Self-hosted likely wins

Hidden costs of self-hosted: 1-2 FTE-months upfront engineering, 0.25 FTE ongoing maintenance, GPU procurement/reservation, monitoring infrastructure.

When to self-host regardless of cost: Data privacy requirements (healthcare, finance), latency requirements (<50ms), model customization needs, or regulatory compliance.

Speculative Decoding

Draft model (small, fast) generates N candidate tokens. Target model (large, accurate) verifies all N in a single forward pass. Accepted tokens are free — only rejected tokens cost a re-generation.

Real speedup: 1.5-2.5x for structured/predictable output (code, JSON). Less benefit for creative/unpredictable text. Works best when draft model is a good predictor of the target.

Serving Architecture Patterns

Single Model

Simplest. One model behind a load balancer. Good for single-purpose applications.

Model Router

Cheap classifier directs requests to different models based on complexity. See AI Cost Optimization for the tiering pattern.

A/B Serving

Two model versions behind a router with traffic splitting. Essential for eval-driven model upgrades.

Sources