AI Product Architecture

Beyond the AI component — the complete system: auth, billing, API, frontend, queues, monitoring. This is what separates “I built an AI demo” from “I shipped an AI product.”

Production Stack

┌──────────────────────────────────────────┐
│  Client: React + AI SDK useChat          │
│  (streaming, feedback, error handling)    │
└──────────────┬───────────────────────────┘
               │
┌──────────────┴───────────────────────────┐
│  API Gateway / Rate Limiter              │
│  (cost-aware, per-tenant budgets)        │
└──────────────┬───────────────────────────┘
               │
┌──────────────┴───────────────────────────┐
│  App Server (Next.js / FastAPI)          │
│  ┌──────┬──────┬──────┬──────┐          │
│  │ LLM  │ RAG  │ Job  │Auth +│          │
│  │Router│ Svc  │Queue │Billing│         │
│  └──┬───┴──┬───┴──┬───┴──┬───┘          │
└─────┼──────┼──────┼──────┼───────────────┘
      │      │      │      │
   Model   Vector  Redis/  Stripe
   APIs    DB      BullMQ  Billing

Auth + Billing for AI

Usage-Based Pricing

Stripe (acquired Metronome) now supports native token metering:

Send granular usage data (tokens, API calls, agent tasks)
Apply markup percentages automatically
Generate invoices with line-item breakdown

Credit System Pattern

User signs up → receives 100 free credits
Each AI request → deducts estimated credits
Credits run out → prompt upgrade or purchase
Auto top-up option for paying users

Implementation Options

Tool	Approach	Best For
Stripe Metering	Native usage events → invoice	SaaS with predictable pricing
Orb	Usage-based billing platform	Complex pricing tiers
Custom	Track in DB, bill via Stripe	Maximum flexibility

API Design for AI

Streaming Endpoints

Use SSE (not WebSockets) for streaming responses:

POST /api/chat
Content-Type: application/json
→ Response: Content-Type: text/event-stream

data: {"type":"text","content":"Hello"}
data: {"type":"text","content":" world"}
data: {"type":"tool_use","name":"search","input":{...}}
data: {"type":"done"}

SSE auto-reconnects, works over standard HTTP, simpler infrastructure than WebSockets.

Long-Running Tasks (>30s)

POST /api/process → 202 Accepted, { "job_id": "abc123" }

GET  /api/jobs/abc123 → { "status": "processing", "progress": 45 }
  or
Webhook → POST https://your-app.com/webhooks/job-complete

Rate Limiting

Rate limit by estimated cost, not just request count:

Traditional: max 100 requests/minute
AI-aware:    max $1.00/minute per user

A single agentic conversation can cost $0.20 -$ 0.50 — traditional rate limits don’t prevent cost spikes.

Queue / Async Patterns

BullMQ (Redis) or SQS for job queues.

Accept request → return job ID → process async → notify on complete

Track progress with phases:

"queued" → "processing" → "generating" → "complete"
           (with progress %)

Essential for: batch processing, document processing, multi-step agent tasks, anything >30s.

Multi-Tenant Isolation

Isolate at three levels:

Level	What to Isolate	Pattern
Data	Each tenant’s documents, embeddings	Row-level security or schema-per-tenant in vector DB
Cost	Per-tenant budget and usage	Budget caps + usage tracking per tenant ID
Config	Prompt templates, model, temperature	Per-tenant config table, loaded at request time

Key: Include tenant ID in every LLM call’s metadata for cost attribution and debugging.

Frontend Architecture

React Patterns for AI

// Vercel AI SDK pattern
const { messages, input, handleSubmit, isLoading, error } = useChat({
  api: '/api/chat',
  onError: (err) => showInlineError(err),  // not toast
})

State Management for Streaming

State	UI
`idle`	Input focused, ready
`submitted`	Skeleton shimmer loading
`streaming`	Progressive text rendering + stop button
`error`	Inline error + retry button
`complete`	Full response + feedback buttons

See AI UX Patterns for detailed UI patterns.

Scaling Trajectory

MVP (0-1K users)

Single server, direct LLM API calls
Stripe basic billing
Langfuse free tier for monitoring
SQLite or Postgres for everything

Growth (1K-10K users)

Add job queue (BullMQ/Redis)
Caching layer (Redis for responses, prompt cache for LLM)
CDN for static assets
Helicone or Langfuse for cost monitoring
Per-tenant rate limits

Scale (10K-100K+ users)

Multi-region deployment
Model routing with provider fallback chains
Dedicated vector DB cluster
Enterprise observability (Arize/Datadog)
Cost-aware autoscaling
Multi-tenant data isolation audit

Real Architecture References

Product	Key Architecture Decision
Vercel v0	AI SDK + streaming RSC (React Server Components) + generative UI
Cursor	Custom streaming protocol, aggressive client-side caching, speculative execution
Intercom Fin	RAG over help center content, human handoff detection, resolution tracking
Jasper	Multi-model routing, template-based generation, brand voice fine-tuning

System Design Interview Checklist

When designing an AI product in an interview:

KahWei's Wiki

Explorer

AI Product Architecture

AI Product Architecture

Production Stack

Auth + Billing for AI

Usage-Based Pricing

Credit System Pattern

Implementation Options

API Design for AI

Streaming Endpoints

Long-Running Tasks (>30s)

Rate Limiting

Queue / Async Patterns

Multi-Tenant Isolation

Frontend Architecture

React Patterns for AI

State Management for Streaming

Scaling Trajectory

MVP (0-1K users)

Growth (1K-10K users)

Scale (10K-100K+ users)

Real Architecture References

System Design Interview Checklist

Sources

Graph View

Table of Contents

Backlinks

KahWei's Wiki

Explorer

AI Product Architecture

AI Product Architecture

Production Stack

Auth + Billing for AI

Usage-Based Pricing

Credit System Pattern

Implementation Options

API Design for AI

Streaming Endpoints

Long-Running Tasks (>30s)

Rate Limiting

Queue / Async Patterns

Multi-Tenant Isolation

Frontend Architecture

React Patterns for AI

State Management for Streaming

Scaling Trajectory

MVP (0-1K users)

Growth (1K-10K users)

Scale (10K-100K+ users)

Real Architecture References

System Design Interview Checklist

Related Pages

Sources

Graph View

Table of Contents

Backlinks