AI Product Architecture

Beyond the AI component — the complete system: auth, billing, API, frontend, queues, monitoring. This is what separates “I built an AI demo” from “I shipped an AI product.”

Production Stack

┌──────────────────────────────────────────┐
│  Client: React + AI SDK useChat          │
│  (streaming, feedback, error handling)    │
└──────────────┬───────────────────────────┘
               │
┌──────────────┴───────────────────────────┐
│  API Gateway / Rate Limiter              │
│  (cost-aware, per-tenant budgets)        │
└──────────────┬───────────────────────────┘
               │
┌──────────────┴───────────────────────────┐
│  App Server (Next.js / FastAPI)          │
│  ┌──────┬──────┬──────┬──────┐          │
│  │ LLM  │ RAG  │ Job  │Auth +│          │
│  │Router│ Svc  │Queue │Billing│         │
│  └──┬───┴──┬───┴──┬───┴──┬───┘          │
└─────┼──────┼──────┼──────┼───────────────┘
      │      │      │      │
   Model   Vector  Redis/  Stripe
   APIs    DB      BullMQ  Billing

Auth + Billing for AI

Usage-Based Pricing

Stripe (acquired Metronome) now supports native token metering:

  • Send granular usage data (tokens, API calls, agent tasks)
  • Apply markup percentages automatically
  • Generate invoices with line-item breakdown

Credit System Pattern

User signs up → receives 100 free credits
Each AI request → deducts estimated credits
Credits run out → prompt upgrade or purchase
Auto top-up option for paying users

Implementation Options

ToolApproachBest For
Stripe MeteringNative usage events → invoiceSaaS with predictable pricing
OrbUsage-based billing platformComplex pricing tiers
CustomTrack in DB, bill via StripeMaximum flexibility

API Design for AI

Streaming Endpoints

Use SSE (not WebSockets) for streaming responses:

POST /api/chat
Content-Type: application/json
→ Response: Content-Type: text/event-stream

data: {"type":"text","content":"Hello"}
data: {"type":"text","content":" world"}
data: {"type":"tool_use","name":"search","input":{...}}
data: {"type":"done"}

SSE auto-reconnects, works over standard HTTP, simpler infrastructure than WebSockets.

Long-Running Tasks (>30s)

POST /api/process → 202 Accepted, { "job_id": "abc123" }

GET  /api/jobs/abc123 → { "status": "processing", "progress": 45 }
  or
Webhook → POST https://your-app.com/webhooks/job-complete

Rate Limiting

Rate limit by estimated cost, not just request count:

Traditional: max 100 requests/minute
AI-aware:    max $1.00/minute per user

A single agentic conversation can cost 0.50 — traditional rate limits don’t prevent cost spikes.

Queue / Async Patterns

BullMQ (Redis) or SQS for job queues.

Accept request → return job ID → process async → notify on complete

Track progress with phases:

"queued" → "processing" → "generating" → "complete"
           (with progress %)

Essential for: batch processing, document processing, multi-step agent tasks, anything >30s.

Multi-Tenant Isolation

Isolate at three levels:

LevelWhat to IsolatePattern
DataEach tenant’s documents, embeddingsRow-level security or schema-per-tenant in vector DB
CostPer-tenant budget and usageBudget caps + usage tracking per tenant ID
ConfigPrompt templates, model, temperaturePer-tenant config table, loaded at request time

Key: Include tenant ID in every LLM call’s metadata for cost attribution and debugging.

Frontend Architecture

React Patterns for AI

// Vercel AI SDK pattern
const { messages, input, handleSubmit, isLoading, error } = useChat({
  api: '/api/chat',
  onError: (err) => showInlineError(err),  // not toast
})

State Management for Streaming

StateUI
idleInput focused, ready
submittedSkeleton shimmer loading
streamingProgressive text rendering + stop button
errorInline error + retry button
completeFull response + feedback buttons

See AI UX Patterns for detailed UI patterns.

Scaling Trajectory

MVP (0-1K users)

  • Single server, direct LLM API calls
  • Stripe basic billing
  • Langfuse free tier for monitoring
  • SQLite or Postgres for everything

Growth (1K-10K users)

  • Add job queue (BullMQ/Redis)
  • Caching layer (Redis for responses, prompt cache for LLM)
  • CDN for static assets
  • Helicone or Langfuse for cost monitoring
  • Per-tenant rate limits

Scale (10K-100K+ users)

  • Multi-region deployment
  • Model routing with provider fallback chains
  • Dedicated vector DB cluster
  • Enterprise observability (Arize/Datadog)
  • Cost-aware autoscaling
  • Multi-tenant data isolation audit

Real Architecture References

ProductKey Architecture Decision
Vercel v0AI SDK + streaming RSC (React Server Components) + generative UI
CursorCustom streaming protocol, aggressive client-side caching, speculative execution
Intercom FinRAG over help center content, human handoff detection, resolution tracking
JasperMulti-model routing, template-based generation, brand voice fine-tuning

System Design Interview Checklist

When designing an AI product in an interview:

  • How do users interact with AI? (Chat, inline, background, agent)
  • What model(s) and why? (Cost/quality trade-off, routing)
  • Where does knowledge come from? (RAG, fine-tuning, prompt)
  • How do you handle failure? (Fallback, retry, human escalation)
  • How do you measure success? (AI Product Metrics)
  • How do you control costs? (AI Cost Optimization)
  • How do you monitor quality post-launch? (LLM Observability and Monitoring)
  • How do you handle multi-tenancy? (Data, cost, config isolation)
  • How do you bill for AI usage? (Credits, usage-based, flat rate)
  • How do you handle security? (LLM Security and Guardrails)

Sources