Production Prompt Engineering

In 2026, prompt engineering is less about “magic phrases” and more about system prompt architecture, tool schema design, and cache economics. The craft has shifted from writing clever prompts to designing prompt systems.

System Prompt Architecture

Production system prompts follow a structured pattern. Claude Code’s system prompt is a reference — thousands of tokens organized as:

┌─────────────────────────────────────────────┐
│  1. Identity & Role Definition               │  ← Who the model is
│  2. Behavioral Rules (prioritized)           │  ← Safety > accuracy > helpfulness
│  3. Tool/Function Definitions                │  ← Schemas + descriptions
│  4. Context Injection (dynamic)              │  ← User files, env info, CLAUDE.md
│  5. Output Format Instructions               │  ← Structured response requirements
│  6. Examples (few-shot)                      │  ← Complex behavior demonstrations
└─────────────────────────────────────────────┘

Key insight: Tool/function schema design is now a core prompt engineering skill. A well-written description field in a function schema is effectively a mini-prompt. Clear parameter constraints, examples, and edge case handling in the schema dramatically improve tool use accuracy.

Prompt Caching Economics

The single biggest cost optimization for apps with stable system prompts.

Anthropic Prompt Caching

Cached input tokens: 90% cheaper than uncached
Cache write: 25% premium on first call
TTL: 5 minutes (resets on each cache hit)

Design Pattern: Cache-Friendly Prompts

[STATIC: System prompt + tool definitions + few-shot]     ← Cached (90% savings)
[SEMI-STATIC: User profile, project context]              ← Cached shorter TTL
[DYNAMIC: Current user message]                           ← Never cached

Real numbers: 10K-token system prompt, 1M calls/month:

Without caching: ~$30/month
With caching: ~$3/month + first-call premium
Savings: ~90%

Design rule: Put all static content at the beginning of the prompt. Dynamic content at the end. Never interleave — it breaks the cache prefix.

Extended Thinking vs Manual CoT

	Native Extended Thinking	Manual CoT (“Let’s think step by step”)
Available on	Claude (extended thinking), o1/o3/o4-mini	Any model
Quality	Better for math, logic, complex reasoning	Good for smaller models
Cost	Thinking tokens are billed	Output tokens are billed
Control	Budget-controllable (max thinking tokens)	Unpredictable length
When to use	Frontier models with native support	Smaller/cheaper models without native thinking

Practical rule: If using a frontier model with native thinking, don’t add manual CoT — it’s redundant and wastes tokens. If using a smaller model (Haiku, GPT-4o-mini, open-source), manual CoT still helps.

Structured Outputs / JSON Mode

Now standard across providers. Both a reliability pattern AND cost optimization — eliminates retry loops from malformed output.

Anthropic: Tool use with strict schemas
OpenAI: response_format: { type: "json_schema", json_schema: {...} }

When to use: Any time the output needs to be machine-parsed. Even for “free text” responses, consider wrapping in a schema with content and metadata fields.

Meta-Prompting

Using LLMs to generate and optimize prompts. Standard workflow in 2026:

1. Write initial prompt
2. Build eval suite (50-100 test cases)
3. Use LLM to generate prompt variations
4. Score against evals
5. Iterate

Tools: DSPy (Stanford) automates prompt optimization through compilation. Anthropic’s metaprompt generates system prompts from task descriptions.

This connects prompting directly to the eval pipeline — prompt engineering and evaluation are no longer separate activities.

Few-Shot vs Zero-Shot in 2026

Scenario	Approach	Why
Frontier model + clear task	Zero-shot + structured output	Models are good enough; examples waste cache space
Complex formatting requirements	2-3 examples	Show, don’t tell — especially for unusual formats
Domain-specific terminology	3-5 examples	Calibrate the model’s vocabulary
Smaller/cheaper models	More examples (5-10)	Compensates for reduced capability
Consistency across calls	Few-shot with canonical examples	Anchors the output distribution

Cache tip: Put few-shot examples in the static system prompt prefix (cached) rather than in each user message (uncached).

Production Anti-Patterns

Anti-Pattern	Why It’s Bad	Do This Instead
”Be very careful and thorough”	Vague, increases verbosity and cost	Specific constraints: “max 3 sentences”
Repeating instructions in every message	Wastes tokens, breaks cache	Put in system prompt once
Manual CoT on frontier models	Redundant with native thinking	Use extended thinking budget
Hardcoded examples in user messages	Can’t be cached	Move to system prompt
No structured output schema	Parsing failures, retries	Always use JSON mode for machine-parsed output

KahWei's Wiki

Explorer

Production Prompt Engineering

Production Prompt Engineering

System Prompt Architecture

Prompt Caching Economics

Anthropic Prompt Caching

Design Pattern: Cache-Friendly Prompts

Extended Thinking vs Manual CoT

Structured Outputs / JSON Mode

Meta-Prompting

Few-Shot vs Zero-Shot in 2026

Production Anti-Patterns

Sources

Graph View

Table of Contents

Backlinks

KahWei's Wiki

Explorer

Production Prompt Engineering

Production Prompt Engineering

System Prompt Architecture

Prompt Caching Economics

Anthropic Prompt Caching

Design Pattern: Cache-Friendly Prompts

Extended Thinking vs Manual CoT

Structured Outputs / JSON Mode

Meta-Prompting

Few-Shot vs Zero-Shot in 2026

Production Anti-Patterns

Related Pages

Sources

Graph View

Table of Contents

Backlinks