Fine-tuning vs Prompting vs RAG

Start with prompting. Add RAG if knowledge is the gap. Fine-tune only when you have proven demand and clear eval metrics showing the gap. The 2026 consensus: fine-tuning is alive but narrower — primarily for cost optimization at scale (distillation) and specialized domains.

Decision Tree

What's the problem?
│
├─ Model lacks specific knowledge
│  (your docs, recent data, proprietary info)
│  → RAG
│
├─ Model knows how but doesn't do it consistently
│  (formatting, tone, following rules, tool use patterns)
│  → Prompt Engineering
│
├─ Model fundamentally can't do the task well
│  (domain-specific language, specialized classification,
│   consistent style that prompting can't capture)
│  → Fine-tuning
│
└─ Need to run cheaply at massive scale
   (proven demand, clear cost pressure)
   → Distillation + Fine-tuning

When Fine-tuning Is Worth It (2026)

Use CaseWhy Fine-tuneExample
Classification at scaleFine-tuned 8B matches GPT-4 at 1/100th costSentiment, intent, toxicity detection
Domain-specific generationStrict terminology requirementsMedical/legal/financial text
Consistent style/formatPixel-perfect output consistency at millions of callsBrand voice, structured reports
Latency-criticalFine-tuned small model: 10-50ms vs 500ms+ APIReal-time classification in request path
DistillationThe dominant cost optimization patternLarge model → training data → small model

When Prompting Is Enough

  • Task is well-defined and the model “gets it” with good instructions
  • Output format can be enforced via structured outputs / JSON mode
  • Volume is low enough that API costs are acceptable
  • You need flexibility to iterate quickly (prompt changes deploy instantly; fine-tuning takes hours)

When RAG Is the Answer

  • Model needs access to proprietary/recent information it wasn’t trained on
  • Knowledge changes frequently (docs, product catalog, policies)
  • You need citations / source attribution
  • See RAG Architecture Patterns 2026 for implementation details

When RAG Is Overkill

  • The knowledge fits in a system prompt (<10K tokens)
  • The knowledge is static and rarely changes → embed in prompt, cache it
  • You only need a few facts → just put them in the prompt

Fine-tuning Methods

MethodVRAM RequiredQualitySpeedUse When
Full fine-tune8x model sizeBestSlowBudget unlimited, max quality
LoRA~1.5x model size95-99% of fullFastStandard production choice
QLoRA~0.5x model size90-97% of fullModerateGPU-constrained, experimentation

LoRA on Llama 70B: ~140GB VRAM (2x A100 80GB) QLoRA on Llama 70B: ~40GB VRAM (1x A100 80GB)

For most production use cases, LoRA is the default. QLoRA for prototyping.

The Distillation Pattern

The dominant cost optimization strategy in 2026:

1. Use frontier model (Claude Opus, GPT-4o) to generate
   10K-100K high-quality examples for your specific task

2. Fine-tune open-source model (Llama 8B/70B) on these examples
   using LoRA

3. Serve fine-tuned model with vLLM

4. Result: 10-100x cost reduction for that specific task

Requirements: Clear eval metrics to validate the fine-tuned model matches the frontier model on your specific task. Without evals, you’re flying blind.

Cost Comparison

ApproachUpfront CostPer-Query CostTime to Deploy
Prompting (Sonnet)$0$3-15/M tokensMinutes
RAG + Prompting$1-5K setup$3-15/M tokens + infra1-2 weeks
Fine-tuning (LoRA)$50-500 (compute)$0.02-0.5/M tokens (self-hosted)1-3 weeks
Distillation pipeline$500-2K (data gen + training)$0.02-0.05/M tokens2-4 weeks

Sources