Multi-Persona RAG Platform

An expert's voice, grounded in their own work.

We turned one expert's lifetime of published work, from video and podcasts to articles and decks, into a chat avatar that sounds like them, thinks in their frameworks, and stays in character. Ten experts went live on the same platform, and every release gets graded before it ships.

4 months Full-stack AI product · RAG · Evaluation Launched Mar 2026
The Brief

Ten experts. One platform. Every answer grounded in their own work.

The brief

The client wanted to bring ten experts to life as chat avatars. Each one had to sound like the real person, answer only from their own body of work, and hold up as a real product at scale. We built one platform for all ten, with quality checks in place before anything went live.

Why most persona chatbots fall short

Most of them sound generic, drift off topic, and nobody can prove they're actually good. Bigger models don't fix that. What does: a persona built from the expert's own content, retrieval that leans on their most authoritative sources, citations verified before the answer streams, and a quality score that decides whether each release is ready to go live.

The scale

The first expert alone brought a lifetime of published work: years of video, podcasts, articles, decks, spreadsheets, and images. We processed all of it cleanly, organized by topic and source type, and made it searchable behind a chat experience that streams answers in seconds. Each persona runs on its own isolated knowledge base, so no persona ever sees another's content.

Private client4 monthsShipped Mar 2026Production AI10 personas live
Pages of expert content, indexed
12,886
From 1,886 files across 7+ formats, processed cleanly at a 98% rate
Of video and audio transcribed
70+ hrs
Every talk, every podcast, every interview, speaker-labeled
Quality score per release
90%
Graded on a 71-question rubric with 12 metrics per answer
Live on one shared platform
10 experts
Each answering only from their own body of work
How we shipped it

Built the platform once. Shipped ten personas on top of it.

What we did

  • A production RAG platform with multi-modal ingestion, retrieval, and streaming chat that shows live citations
  • A persona engine that builds each expert's voice from their own content, not from hand-crafted prompts
  • An evaluation system that grades every release on a 71-question rubric before it goes live
  • Shared infrastructure that scaled from one persona to ten without re-engineering
  • A persona builder and operations dashboard so the client's team can run the platform day to day
  • Dev, staging, and production environments on a single shared cache, so no paid extraction ever runs twice
  • Containerized deployment with reliable ingestion at scale and full observability on latency and cost

Our process

Phase 1: Core platform and first persona
2 weeks

We solved the hard problems once. The shared infrastructure went up against the first expert's complete body of work, and we took every piece end to end: ingestion, retrieval, streaming chat, and the evaluation harness that gates every release.

Phase 2: Nine personas in parallel
1 week

Then we stood up nine more experts on the same core. Each one got its own conditioning and quality pass, and the platform absorbed the variance so every addition cost days, not weeks.

Phase 3: Product integration and launch
1 week

We connected the platform to the client's product surfaces, ran the full evaluation regression across all ten personas, and shipped to production.

Services covered

Retrieval-augmented generationMulti-modal ingestion pipelinesVector search infrastructureLLM evaluation and reportingAPI and frontend integrationContainerized production deploy
Under the hood

A stack built for multi-modal ingestion, strict grounding, and measurable quality.

Frontend
  • Next.js, React, and TypeScript
  • Persona builder workspace
  • Jobs dashboard with usage and cost analytics
  • Streaming chat UI with live inline citations
  • Public share pages for distribution
API
  • FastAPI backend on Python
  • Reliable ingestion workers with retry and recovery
  • Persona-scoped retrieval and streaming chat endpoints
  • Evaluation runner with one-click PDF reports
  • Secure auth and session management
Data
  • MongoDB for content, conversations, personas, and analytics
  • Qdrant for vector search, with one isolated collection per persona
  • AWS S3 for source media, transcripts, and generated reports
  • Shared cache layer, so expensive extractions run once and get reused everywhere
AI
  • OpenAI GPT models for semantic understanding and chat
  • OpenAI embeddings for high-dimensional vector search
  • AssemblyAI for speaker-labeled audio and video transcription
  • Azure Document Intelligence for high-fidelity PDF OCR
  • Rerank models for relevance gating on retrieval
  • Local OCR fallback for image-embedded text extraction
Ingestion sources
  • YouTube videos, playlists, and entire channels
  • Loom videos, web articles, and full website sitemaps
  • PDF, PowerPoint, Word, and Excel extractors
  • Audio files and images with built-in OCR
  • Automatic deduplication, so the same content never gets processed twice
Retrieval pipeline
  • Queries expanded and topic-shift aware
  • Hybrid search using vector similarity and lexical matching
  • Rerank models gate relevance before scoring
  • Authoritative sources weighted higher than secondary ones
  • Low-confidence fallback says 'not enough evidence' instead of guessing
  • Citation placement verified against sources before the answer streams

Deployment pipeline

Build
Docker Compose (dev and prod stacks) • Isolated environments
Dev, staging, and production run consistent containers with their own data stores and a shared extraction cache.
Test
71-question evaluation regression • Versioned PDF quality reports
Every persona release is benchmarked before it ships. A persona that regresses doesn't go live.
Deploy
Supervised worker processes • Auto-recovery and scaling • Observability on latency and cost
Production workers recover automatically from stalls, and the shared cache keeps staging and production cost-equivalent.

Stack summary

How each persona is built
  • Corpus-synthesized profile
    Voice, frameworks, signature phrases, and guardrails all come from the expert's own content, not from hand-crafted prompts.
  • Source selection
    The platform picks the training examples that best represent how the expert actually thinks and talks.
  • Intent-aware routing
    Incoming questions are classified so the right evidence and tone show up in the answer.
  • Versioned and auto-upgrading
    Every deployment runs the current canonical profile on startup, so the live persona never drifts from the latest training.
How it stays grounded
  • Authoritative-source weighting
    The expert's primary content (books, talks, flagship essays) outweighs secondary and third-party sources during retrieval.
  • Attribution repair
    Every citation is verified against the retrieved sources before the answer reaches the user. Misattribution never makes it to the screen.
  • Low-confidence fallback
    If the evidence isn't strong enough, the persona says so. It never fills the gap with a guess.
  • Persona-isolated knowledge
    Each expert has their own vector collection. No persona ever sees another's content.
How we prove quality every release
  • 71-question evaluation rubric
    A standardized benchmark covers real-world scenarios, edge cases, tone, and more, and it runs against every persona release.
  • 12 LLM-graded metrics per answer
    Persona fidelity, grounding integrity, source accuracy, tone, specificity, and more, all scored on a 0 to 1 scale.
  • Parallel regression in minutes
    A full rubric sweep finishes in single-digit minutes, so quality checks never become a release bottleneck.
  • One-click stakeholder reports
    Every run produces a client-ready PDF with aggregate scores, per-question breakdowns, and full transcripts.

Key integrations

OpenAIQdrant CloudMongoDB AtlasAWS S3AssemblyAIAzure Document IntelligenceCohere RerankDocker
Outcome

Ten personas live. A platform the team can operate. Quality that ships on schedule.

A new product line for the client. Ten expert personas, all live on one platform, each answering only from their own body of work.

A repeatable process. Personas two through ten shipped in days each, because the core platform absorbed the variance.

Release confidence. No persona goes live without clearing the quality bar, and regressions block the release.

Cost-efficient operation. Shared infrastructure keeps dev, staging, and production on one cache, and every AI call gets attributed to a cost line the team can see.

RAGHybrid RetrievalVector SearchLLM EvaluationMulti-PersonaNext.jsFastAPIQdrantMongoDBOpenAI
Feature highlights
  • Multi-modal ingestion: video, podcasts, articles, PDFs, decks, spreadsheets, audio, and images
  • Personas that sound like the expert, not a generic assistant
  • Inline citations on every answer, linked to the exact source
  • Attribution verified before the answer streams, so misattribution never reaches the user
  • The persona says 'not enough evidence' instead of guessing
  • Conversation memory with rolling context and cross-session continuity at the persona level
  • Auto-generated follow-up questions keep conversations flowing
  • Operations dashboard with latency, cost, and job health visible to the client's team

Innovations

Personas synthesized from content

Voice, frameworks, signature phrases, and guardrails all come from what the expert has actually said and written, not from hand-crafted prompts. Every deployment refreshes the persona profile automatically, so the live experience never drifts from the latest training.

Citations verified before the answer streams

The platform generates the answer in full, verifies every citation against the actual source, and only then streams it to the user. Misattribution and hallucinated sources never reach the screen.

Every release graded against 71 questions

A standardized benchmark tests each persona across real-world scenarios, edge cases, tone, and more. Results land in a client-ready PDF. A persona that regresses doesn't ship.

Shared cache across environments

Transcripts, OCR output, and extractions are cached by content hash and shared across dev, staging, and production. No expensive processing ever runs twice, so ingestion costs drop over time.

Why it matters
  • Quality is a release gate, not a promise. The client knows exactly where each persona stands before it goes live.
  • Adding a persona became a days-long exercise instead of a months-long one, because the platform scales with the business.
  • Every dollar of AI spend is attributable to a specific operation, so the client can manage cost without hunting through logs.
How content flows into the platform
  • Fetch the source, whether it's video, PDF, article, deck, or audio
  • Extract the content using transcription, OCR, or structured parsing, whichever fits
  • Classify the source as authoritative, secondary, or contextual
  • Annotate with topics, frameworks, claims, and signature language
  • Chunk and embed by breaking content into thought-sized units and indexing for search
  • Store and isolate, so each persona gets its own knowledge base
  • Validate by sanity-checking every new source with a sample retrieval before it goes live
What every release is tested against
  • Core operating system: how the expert diagnoses, prioritizes, and plans
  • Edge cases and persona resilience: vague questions, pushback, identity challenges
  • Framework depth and attribution: deep knowledge with correct sourcing
  • Natural coaching flow: messy opener to concrete action plan
  • Real-world scenarios: the kind of questions users actually ask
  • Topic-shift and conversation agility: staying useful across thread switches