AI Cost Optimization

The three highest-ROI optimizations: prompt caching (90% input savings), model tiering (60-70% requests on cheap tier), and batch API (50% off). Most teams over-spend on model quality and under-invest in these mechanical optimizations.

Token Pricing Landscape (April 2026)

Per 1M tokens:

ModelInputOutputNotes
Claude Opus 4$15$75Maximum capability
Claude Sonnet 4$3$15Production workhorse
Claude Haiku 3.5$0.80$4Fast, cheap
GPT-4o$2.50$10OpenAI flagship
GPT-4o mini$0.15$0.60Cheapest major API
Gemini 2.5 Pro$1.25-2.50$10-15Tiered by context
Llama 70B (Together)$0.54$0.54Open-source hosted
Llama 8B (self-hosted)~$0.02-0.05~$0.02-0.05Amortized GPU

Optimization Strategies (Ranked by ROI)

1. Prompt Caching (90% Input Savings)

The single biggest optimization. Design prompts with static prefix:

[STATIC prefix: system prompt + tools + examples]  ← Cached at 90% discount
[DYNAMIC suffix: user message]                      ← Full price

Real numbers: 10K-token system prompt, 1M calls/month:

  • Without caching: ~$30
  • With caching: ~$3
  • Rule: Never put dynamic content before static content

2. Model Tiering / Routing (60-70% Cost Reduction)

Use a cheap model for simple requests, expensive model for complex ones:

User request
  → Classifier (Haiku / 4o-mini, ~$0.15/M)
    ├─ Simple (60-70% of requests) → Haiku/4o-mini
    └─ Complex (30-40%) → Sonnet/GPT-4o

Production implementations report 60-70% of requests handled by the cheap tier. Frameworks: Martian, Portkey, LiteLLM, or custom routing logic.

3. Batch API (50% Off)

Both Anthropic and OpenAI offer batch endpoints:

  • 50% discount on all tokens
  • 24-hour SLA (not real-time)
  • Perfect for: eval runs, data processing, content generation pipelines, nightly jobs

4. Context Window Economics

Sending 200K tokens of context: 0.006/call + embedding cost

RAG is ~100x cheaper per query for knowledge retrieval. Use long context only when you need holistic understanding of the full document, not keyword lookup.

5. Token Reduction

TechniqueSavingsQuality Impact
Prompt compression (LLMLingua)2-5xMinimal for well-structured content
Output constraints (max_tokens)20-40%None if constraints match needs
Structured output schemas10-20%Prevents verbose padding
Removing redundant instructions10-30%None (these were wasted tokens)

6. Distillation (10-100x for Specific Tasks)

The ultimate optimization — see Fine-tuning vs Prompting vs RAG:

Frontier model → 10K-100K training examples → Fine-tune Llama 8B → Serve with vLLM

Requires 2-4 weeks engineering investment and clear eval metrics.

Real Production Cost Examples

ApplicationScaleMonthly CostKey Optimization
AI coding assistant10K users, 100 req/user/day$50-150KCaching + Sonnet/Haiku mix
Customer support bot50K conversations/day$5-15K80% Haiku / 20% Sonnet routing
Document processing100K docs/day$2-8KBatch API + caching
Self-hosted Llama 70B5M+ requests/day$8-12K (4x A100)Amortized GPU

The Self-Hosted Crossover

API cost/month > (GPU cost + engineering amortization)

Crossover typically at $15-20K/month for a single model type. Engineering cost: 1-2 FTE-months upfront, 0.25 FTE ongoing.

See LLM Inference and Serving for deployment details.

Cost Monitoring Checklist

  • Track cost per request (not just monthly total)
  • Monitor cache hit rate (should be >80% for stable prompts)
  • Measure routing accuracy (cheap model handling correct % of requests)
  • Alert on cost spikes (denial-of-wallet prevention)
  • Compare model quality vs cost weekly (prices drop frequently)

Sources