Data Engineering for AI

The quality of your AI product is bounded by the quality of your data. Fine-tuning, RAG, and evals all depend on data pipelines. This page covers the engineering — not the AI, but the plumbing that makes AI work.

RAG Data Pipeline (Production)

Sources (docs, APIs, DBs)
  → Ingest (webhook/CDC/scheduled)
  → Clean & Parse (extract text, strip formatting)
  → Chunk (strategy depends on content type)
  → Embed (batch, with caching)
  → Vector Store (upsert, with metadata)
  → Freshness Monitor (staleness alerts)

Incremental Ingestion (Not One-Time Bulk)

Production RAG requires change detection:

Pattern	How It Works	Best For
Content hash comparison	Hash content on ingest, skip if unchanged	Document repos, knowledge bases
CDC (Change Data Capture)	Database triggers push changes	Structured data sources
Webhook-driven	Source system notifies on change	SaaS integrations (Notion, Confluence)
Scheduled re-scan	Cron job checks for updates	Sources without change notifications

Always track last-updated timestamp per source on a freshness dashboard.

Chunking Strategies

A NAACL 2025 study (Vectara) across 25 chunking configs and 48 embedding models found: chunking strategy matters as much as embedding model choice.

Strategy	Chunk Size	Best For	Complexity
Fixed-size	512-1024 tokens, 10-20% overlap	Reliable baseline for most content	Low
Recursive character	Split on `\n\n` → `\n` → `.` →	Code, mixed-format documents	Low
Markdown header	Split on `#`, `##`, `###`	Documentation, wikis	Low
Semantic	Split on topic boundaries (embedding similarity)	Heterogeneous corpora	High

Practical advice: Start with fixed-size (512 tokens, 50-token overlap). Only move to semantic chunking if eval metrics show retrieval quality issues with fixed-size.

Embedding Models (April 2026)

Model	Dimensions	Max Context	Best For
Voyage AI voyage-3-large	1024	32K tokens	Highest MTEB score, long documents
OpenAI text-embedding-3-small	1536	8K tokens	Cost-effective, good enough for most
OpenAI text-embedding-3-large	3072	8K tokens	Higher precision when small isn’t enough

Always evaluate on your domain before choosing. MTEB leaderboard rankings don’t transfer perfectly to every use case.

Embedding Pipeline Pattern

# Batch embedding with caching
for chunk in chunks:
    cache_key = hash(chunk.content + model_version)
    if cache_key in embedding_cache:
        embedding = embedding_cache[cache_key]
    else:
        embedding = embed(chunk.content)  # batch these
        embedding_cache[cache_key] = embedding
    vector_store.upsert(chunk.id, embedding, chunk.metadata)

Data Quality Monitoring

Metric	What to Monitor	Alert When
Embedding drift	Cosine similarity distribution shift over time	Distribution diverges >10% from baseline
Retrieval relevance	Average relevance score of retrieved chunks	Drops below threshold
Chunk hit rate	% of stored chunks that actually get retrieved	<5% of chunks ever retrieved (over-indexing)
Source freshness	Last-updated timestamp per source	Any source >7 days stale
Ingestion errors	Failed parses, encoding issues, timeout	>1% failure rate

Synthetic Data Generation

The dominant pattern for fine-tuning cost optimization:

1. Define task and format (classification, QA, generation)
2. Write 10-20 high-quality examples manually (golden set)
3. Use frontier model (Opus, GPT-4o) to generate 10K-100K variations
4. Filter: remove low-quality, deduplicate, validate format
5. Fine-tune smaller model on synthetic dataset
6. Evaluate against golden set + held-out real data

Pitfalls:

Model collapse: synthetic data amplifies frontier model’s biases
Distribution mismatch: synthetic != real user queries
Mitigation: Always mix 10-20% real data into synthetic training set

Data Labeling

Tool	Type	Best For
Label Studio	Open source, self-hosted	Full control, custom workflows
Argilla	Open source, HuggingFace integrated	ML-focused labeling, active learning
Scale AI	Managed service	Volume labeling, enterprise

Active learning pattern: Use model confidence to prioritize what to label. Label the examples the model is most uncertain about — maximum information gain per label.

PII Handling

Strip PII before embedding, not after retrieval:

Ingestion pipeline:
  → Entity recognition (spaCy, Microsoft Presidio)
  → Redact or tokenize PII
  → Embed redacted content
  → Store PII mapping separately (if needed for reconstruction)

Retrieval:
  → Access control at chunk level
  → Tag chunks with permission scopes
  → Filter at retrieval time, not after generation

Minimum Viable Datasets

Task	Minimum for Few-Shot	Minimum for Fine-Tuning
Classification	50-100 examples per class	500+ per class
QA / RAG	10-20 high-quality documents	1K+ Q&A pairs
Generation	5-10 examples in prompt	5K-10K examples
Extraction	20-30 annotated examples	500+ annotated documents

Starting advice: Even 10-20 high-quality documents can power effective RAG if chunking and retrieval are solid. Data quality >>> data quantity.

KahWei's Wiki

Explorer

Data Engineering for AI

Data Engineering for AI

RAG Data Pipeline (Production)

Incremental Ingestion (Not One-Time Bulk)

Chunking Strategies

Embedding Models (April 2026)

Embedding Pipeline Pattern

Data Quality Monitoring

Synthetic Data Generation

Data Labeling

PII Handling

Minimum Viable Datasets

Sources

Graph View

Table of Contents

Backlinks

KahWei's Wiki

Explorer

Data Engineering for AI

Data Engineering for AI

RAG Data Pipeline (Production)

Incremental Ingestion (Not One-Time Bulk)

Chunking Strategies

Embedding Models (April 2026)

Embedding Pipeline Pattern

Data Quality Monitoring

Synthetic Data Generation

Data Labeling

PII Handling

Minimum Viable Datasets

Related Pages

Sources

Graph View

Table of Contents

Backlinks