Data Engineering for AI

The quality of your AI product is bounded by the quality of your data. Fine-tuning, RAG, and evals all depend on data pipelines. This page covers the engineering — not the AI, but the plumbing that makes AI work.

RAG Data Pipeline (Production)

Sources (docs, APIs, DBs)
  → Ingest (webhook/CDC/scheduled)
  → Clean & Parse (extract text, strip formatting)
  → Chunk (strategy depends on content type)
  → Embed (batch, with caching)
  → Vector Store (upsert, with metadata)
  → Freshness Monitor (staleness alerts)

Incremental Ingestion (Not One-Time Bulk)

Production RAG requires change detection:

PatternHow It WorksBest For
Content hash comparisonHash content on ingest, skip if unchangedDocument repos, knowledge bases
CDC (Change Data Capture)Database triggers push changesStructured data sources
Webhook-drivenSource system notifies on changeSaaS integrations (Notion, Confluence)
Scheduled re-scanCron job checks for updatesSources without change notifications

Always track last-updated timestamp per source on a freshness dashboard.

Chunking Strategies

A NAACL 2025 study (Vectara) across 25 chunking configs and 48 embedding models found: chunking strategy matters as much as embedding model choice.

StrategyChunk SizeBest ForComplexity
Fixed-size512-1024 tokens, 10-20% overlapReliable baseline for most contentLow
Recursive characterSplit on \n\n\n. Code, mixed-format documentsLow
Markdown headerSplit on #, ##, ###Documentation, wikisLow
SemanticSplit on topic boundaries (embedding similarity)Heterogeneous corporaHigh

Practical advice: Start with fixed-size (512 tokens, 50-token overlap). Only move to semantic chunking if eval metrics show retrieval quality issues with fixed-size.

Embedding Models (April 2026)

ModelDimensionsMax ContextBest For
Voyage AI voyage-3-large102432K tokensHighest MTEB score, long documents
OpenAI text-embedding-3-small15368K tokensCost-effective, good enough for most
OpenAI text-embedding-3-large30728K tokensHigher precision when small isn’t enough

Always evaluate on your domain before choosing. MTEB leaderboard rankings don’t transfer perfectly to every use case.

Embedding Pipeline Pattern

# Batch embedding with caching
for chunk in chunks:
    cache_key = hash(chunk.content + model_version)
    if cache_key in embedding_cache:
        embedding = embedding_cache[cache_key]
    else:
        embedding = embed(chunk.content)  # batch these
        embedding_cache[cache_key] = embedding
    vector_store.upsert(chunk.id, embedding, chunk.metadata)

Data Quality Monitoring

MetricWhat to MonitorAlert When
Embedding driftCosine similarity distribution shift over timeDistribution diverges >10% from baseline
Retrieval relevanceAverage relevance score of retrieved chunksDrops below threshold
Chunk hit rate% of stored chunks that actually get retrieved<5% of chunks ever retrieved (over-indexing)
Source freshnessLast-updated timestamp per sourceAny source >7 days stale
Ingestion errorsFailed parses, encoding issues, timeout>1% failure rate

Synthetic Data Generation

The dominant pattern for fine-tuning cost optimization:

1. Define task and format (classification, QA, generation)
2. Write 10-20 high-quality examples manually (golden set)
3. Use frontier model (Opus, GPT-4o) to generate 10K-100K variations
4. Filter: remove low-quality, deduplicate, validate format
5. Fine-tune smaller model on synthetic dataset
6. Evaluate against golden set + held-out real data

Pitfalls:

  • Model collapse: synthetic data amplifies frontier model’s biases
  • Distribution mismatch: synthetic != real user queries
  • Mitigation: Always mix 10-20% real data into synthetic training set

Data Labeling

ToolTypeBest For
Label StudioOpen source, self-hostedFull control, custom workflows
ArgillaOpen source, HuggingFace integratedML-focused labeling, active learning
Scale AIManaged serviceVolume labeling, enterprise

Active learning pattern: Use model confidence to prioritize what to label. Label the examples the model is most uncertain about — maximum information gain per label.

PII Handling

Strip PII before embedding, not after retrieval:

Ingestion pipeline:
  → Entity recognition (spaCy, Microsoft Presidio)
  → Redact or tokenize PII
  → Embed redacted content
  → Store PII mapping separately (if needed for reconstruction)

Retrieval:
  → Access control at chunk level
  → Tag chunks with permission scopes
  → Filter at retrieval time, not after generation

Minimum Viable Datasets

TaskMinimum for Few-ShotMinimum for Fine-Tuning
Classification50-100 examples per class500+ per class
QA / RAG10-20 high-quality documents1K+ Q&A pairs
Generation5-10 examples in prompt5K-10K examples
Extraction20-30 annotated examples500+ annotated documents

Starting advice: Even 10-20 high-quality documents can power effective RAG if chunking and retrieval are solid. Data quality >>> data quantity.

Sources