AI Evals and Testing

The hardest problem in production AI: “How do you know your product is working?” Eval-driven development (EDD) is becoming the AI equivalent of TDD. The key insight: optimize retrieval quality, not model selection — when generation hallucinates, the root cause is almost always bad retrieval.

The Eval Platform Landscape

PlatformTypeStrengthNotable
BraintrustSaaSEnd-to-end eval lifecycle (datasets, scoring, experiments, CI gates)800M valuation
PromptfooOSS → acquiredRed-teaming, security testing, CLI-nativeAcquired by OpenAI for $86M (March 2026)
LangSmithSaaSLangChain/LangGraph native, full execution tree renderingBest for LangChain stacks
DeepEvalOSS (Python)50+ metrics, Pytest integration, CI/CD nativev3.0: component-level granularity, agent simulation
Arize PhoenixOSSOpenTelemetry-based, span-level tracingBest for production observability
Maxim AISaaSRAG-specific evaluationLeading RAG eval platform
EvidentlyOSS25M+ downloads, regression testing, drift detectionBest for monitoring over time

Types of Evals

LLM-as-Judge

Model scores another model’s output. Fast, scalable, cheapest at volume.

  • Cost: ~$0.01-0.10 per evaluation
  • Use smaller models (GPT-4o-mini, Haiku) for routine evals, larger for calibration
  • Requires calibration against human judgments to be trustworthy
  • Both DeepEval and Braintrust support this natively

Human Eval

Gold standard. Expensive. Used for:

  • Calibrating automated metrics
  • High-stakes decisions
  • Subjective quality assessment
  • Braintrust and LangSmith provide annotation UIs

Automated Metrics

BLEU, ROUGE, semantic similarity, faithfulness, hallucination detection. Applied in CI/CD pipelines for regression testing.

Regression Testing

Fixed test suites on every prompt/model change. Braintrust enforces as a release gate.

Eval-Driven Development (EDD)

The AI equivalent of TDD. Real and growing practice in 2026:

1. Define eval datasets
   (questions + expected outputs + context)

2. Write metric definitions
   (faithfulness, relevance, tool accuracy)

3. Run evals on every change
   (prompt edits, model swaps, retrieval config changes)

4. Gate releases on eval scores
   (thresholds must pass before deploy)

5. Monitor production with the same metrics
   (continuous, not just pre-deploy)

Braintrust is the primary platform enabling this end-to-end. DeepEval’s Pytest integration makes EDD feel like TDD for AI.

Testing RAG Systems

Evaluate retrieval and generation separately — when generation fails, the root cause is almost always retrieval:

Retrieval Metrics

MetricWhat It Measures
Context Precision% of retrieved chunks that are relevant
Context Recall% of relevant chunks that were retrieved
NDCGPrecision weighted by position (top results matter more)

NDCG correlates most strongly with end-to-end quality — better than binary precision/recall.

Generation Metrics

MetricWhat It Measures
FaithfulnessEvery claim grounded in retrieved context?
Answer RelevancyDoes the answer address the question?
Hallucination RateClaims not supported by any retrieved chunk

End-to-End

Answer correctness vs ground truth, latency, token usage.

Anti-Pattern

Optimizing prompts to maximize Ragas scores produces systems that score well on evals but fail on production queries with slight variations. Metrics should track quality, not be the optimization target.

Testing Agents

The hardest unsolved problem. Key approaches:

ApproachWhat It Tests
Step-level evalScore each step independently — wrong tool call in step 2 corrupts everything
Tool selection accuracyDid the agent pick the right tool?
Planning qualityWas the plan reasonable before execution?
Multi-turn simulationGenerate realistic branching conversations (DeepEval v3.0)
Trajectory evalCompare actual path against ideal trajectories

The key insight: evaluate the process (tool selection, planning, reasoning), not just the final output. A correct answer reached via wrong reasoning is a ticking time bomb.

Cost and Latency

DimensionTypical Range
LLM-as-judge per eval$0.01-0.10
500-case eval suite$5-50 per run
Full suite latency2-10 minutes (parallelized)
Production monitoringContinuous sampling, ~1-5% of requests

Cost optimization: Use smaller models (Haiku, GPT-4o-mini) for routine evals, larger models for calibration and edge cases.

Emerging Standards

  • OpenTelemetry as the standard for LLM tracing and observability
  • Traceability: Link every eval score to exact prompt version, model, and dataset
  • CI/CD integration: Evals as first-class pipeline stages, not afterthoughts
  • Governance hooks: Audit trails, score history, approval workflows for enterprise

Practical Starting Guide

If building from scratch:

  1. Start with DeepEval (free, Pytest integration) for development-time evals
  2. Create 50-100 representative test cases covering your key use cases
  3. Define 3-5 core metrics (faithfulness, relevance, and domain-specific ones)
  4. Add to CI pipeline — fail the build if scores drop below thresholds
  5. Add Arize Phoenix for production monitoring

If using managed tools:

  • Braintrust for the full lifecycle (datasets → evals → experiments → CI → monitoring)
  • LangSmith if your stack is LangChain/LangGraph

Sources