Multi-Persona RAG Platform

An expert's voice, grounded in their own work.

We turned one expert's lifetime of published work, from video and podcasts to articles and decks, into a chat avatar that sounds like them, thinks in their frameworks, and stays in character. Ten experts went live on the same platform, and every release gets graded before it ships.

4 months Full-stack AI product · RAG · Evaluation Launched Mar 2026

The Brief

Ten experts. One platform. Every answer grounded in their own work.

The brief

The client wanted to bring ten experts to life as chat avatars. Each one had to sound like the real person, answer only from their own body of work, and hold up as a real product at scale. We built one platform for all ten, with quality checks in place before anything went live.

Why most persona chatbots fall short

Most of them sound generic, drift off topic, and nobody can prove they're actually good. Bigger models don't fix that. What does: a persona built from the expert's own content, retrieval that leans on their most authoritative sources, citations verified before the answer streams, and a quality score that decides whether each release is ready to go live.

The scale

The first expert alone brought a lifetime of published work: years of video, podcasts, articles, decks, spreadsheets, and images. We processed all of it cleanly, organized by topic and source type, and made it searchable behind a chat experience that streams answers in seconds. Each persona runs on its own isolated knowledge base, so no persona ever sees another's content.

Private client4 monthsShipped Mar 2026Production AI10 personas live

Pages of expert content, indexed

12,886

From 1,886 files across 7+ formats, processed cleanly at a 98% rate

Of video and audio transcribed

70+ hrs

Every talk, every podcast, every interview, speaker-labeled

Quality score per release

90%

Graded on a 71-question rubric with 12 metrics per answer

Live on one shared platform

10 experts

Each answering only from their own body of work

How we shipped it

Built the platform once. Shipped ten personas on top of it.

What we did

A production RAG platform with multi-modal ingestion, retrieval, and streaming chat that shows live citations
A persona engine that builds each expert's voice from their own content, not from hand-crafted prompts
An evaluation system that grades every release on a 71-question rubric before it goes live
Shared infrastructure that scaled from one persona to ten without re-engineering
A persona builder and operations dashboard so the client's team can run the platform day to day
Dev, staging, and production environments on a single shared cache, so no paid extraction ever runs twice
Containerized deployment with reliable ingestion at scale and full observability on latency and cost

Our process

Phase 1: Core platform and first persona

2 weeks

We solved the hard problems once. The shared infrastructure went up against the first expert's complete body of work, and we took every piece end to end: ingestion, retrieval, streaming chat, and the evaluation harness that gates every release.

Phase 2: Nine personas in parallel

1 week

Then we stood up nine more experts on the same core. Each one got its own conditioning and quality pass, and the platform absorbed the variance so every addition cost days, not weeks.

Phase 3: Product integration and launch

1 week

We connected the platform to the client's product surfaces, ran the full evaluation regression across all ten personas, and shipped to production.

Services covered

Retrieval-augmented generationMulti-modal ingestion pipelinesVector search infrastructureLLM evaluation and reportingAPI and frontend integrationContainerized production deploy

Under the hood

A stack built for multi-modal ingestion, strict grounding, and measurable quality.

Frontend

Next.js, React, and TypeScript
Persona builder workspace
Jobs dashboard with usage and cost analytics
Streaming chat UI with live inline citations
Public share pages for distribution

API

FastAPI backend on Python
Reliable ingestion workers with retry and recovery
Persona-scoped retrieval and streaming chat endpoints
Evaluation runner with one-click PDF reports
Secure auth and session management

Data

MongoDB for content, conversations, personas, and analytics
Qdrant for vector search, with one isolated collection per persona
AWS S3 for source media, transcripts, and generated reports
Shared cache layer, so expensive extractions run once and get reused everywhere

OpenAI GPT models for semantic understanding and chat
OpenAI embeddings for high-dimensional vector search
AssemblyAI for speaker-labeled audio and video transcription
Azure Document Intelligence for high-fidelity PDF OCR
Rerank models for relevance gating on retrieval
Local OCR fallback for image-embedded text extraction

Ingestion sources

YouTube videos, playlists, and entire channels
Loom videos, web articles, and full website sitemaps
PDF, PowerPoint, Word, and Excel extractors
Audio files and images with built-in OCR
Automatic deduplication, so the same content never gets processed twice

Retrieval pipeline

Queries expanded and topic-shift aware
Hybrid search using vector similarity and lexical matching
Rerank models gate relevance before scoring
Authoritative sources weighted higher than secondary ones
Low-confidence fallback says 'not enough evidence' instead of guessing
Citation placement verified against sources before the answer streams

Deployment pipeline

Build

Docker Compose (dev and prod stacks) • Isolated environments

Dev, staging, and production run consistent containers with their own data stores and a shared extraction cache.

Test

71-question evaluation regression • Versioned PDF quality reports

Every persona release is benchmarked before it ships. A persona that regresses doesn't go live.

Deploy

Supervised worker processes • Auto-recovery and scaling • Observability on latency and cost

Production workers recover automatically from stalls, and the shared cache keeps staging and production cost-equivalent.

Stack summary

How each persona is built

Corpus-synthesized profile
Voice, frameworks, signature phrases, and guardrails all come from the expert's own content, not from hand-crafted prompts.
Source selection
The platform picks the training examples that best represent how the expert actually thinks and talks.
Intent-aware routing
Incoming questions are classified so the right evidence and tone show up in the answer.
Versioned and auto-upgrading
Every deployment runs the current canonical profile on startup, so the live persona never drifts from the latest training.

How it stays grounded

Authoritative-source weighting
The expert's primary content (books, talks, flagship essays) outweighs secondary and third-party sources during retrieval.
Attribution repair
Every citation is verified against the retrieved sources before the answer reaches the user. Misattribution never makes it to the screen.
Low-confidence fallback
If the evidence isn't strong enough, the persona says so. It never fills the gap with a guess.
Persona-isolated knowledge
Each expert has their own vector collection. No persona ever sees another's content.

How we prove quality every release

71-question evaluation rubric
A standardized benchmark covers real-world scenarios, edge cases, tone, and more, and it runs against every persona release.
12 LLM-graded metrics per answer
Persona fidelity, grounding integrity, source accuracy, tone, specificity, and more, all scored on a 0 to 1 scale.
Parallel regression in minutes
A full rubric sweep finishes in single-digit minutes, so quality checks never become a release bottleneck.
One-click stakeholder reports
Every run produces a client-ready PDF with aggregate scores, per-question breakdowns, and full transcripts.

Key integrations

OpenAIQdrant CloudMongoDB AtlasAWS S3AssemblyAIAzure Document IntelligenceCohere RerankDocker

Outcome

Ten personas live. A platform the team can operate. Quality that ships on schedule.

A new product line for the client. Ten expert personas, all live on one platform, each answering only from their own body of work.

A repeatable process. Personas two through ten shipped in days each, because the core platform absorbed the variance.

Release confidence. No persona goes live without clearing the quality bar, and regressions block the release.

Cost-efficient operation. Shared infrastructure keeps dev, staging, and production on one cache, and every AI call gets attributed to a cost line the team can see.

RAGHybrid RetrievalVector SearchLLM EvaluationMulti-PersonaNext.jsFastAPIQdrantMongoDBOpenAI

Feature highlights

Multi-modal ingestion: video, podcasts, articles, PDFs, decks, spreadsheets, audio, and images
Personas that sound like the expert, not a generic assistant
Inline citations on every answer, linked to the exact source
Attribution verified before the answer streams, so misattribution never reaches the user
The persona says 'not enough evidence' instead of guessing
Conversation memory with rolling context and cross-session continuity at the persona level
Auto-generated follow-up questions keep conversations flowing
Operations dashboard with latency, cost, and job health visible to the client's team

Innovations

Personas synthesized from content

Voice, frameworks, signature phrases, and guardrails all come from what the expert has actually said and written, not from hand-crafted prompts. Every deployment refreshes the persona profile automatically, so the live experience never drifts from the latest training.

Citations verified before the answer streams

The platform generates the answer in full, verifies every citation against the actual source, and only then streams it to the user. Misattribution and hallucinated sources never reach the screen.

Every release graded against 71 questions

A standardized benchmark tests each persona across real-world scenarios, edge cases, tone, and more. Results land in a client-ready PDF. A persona that regresses doesn't ship.

Shared cache across environments

Transcripts, OCR output, and extractions are cached by content hash and shared across dev, staging, and production. No expensive processing ever runs twice, so ingestion costs drop over time.

Why it matters

Quality is a release gate, not a promise. The client knows exactly where each persona stands before it goes live.
Adding a persona became a days-long exercise instead of a months-long one, because the platform scales with the business.
Every dollar of AI spend is attributable to a specific operation, so the client can manage cost without hunting through logs.

How content flows into the platform

Fetch the source, whether it's video, PDF, article, deck, or audio
Extract the content using transcription, OCR, or structured parsing, whichever fits
Classify the source as authoritative, secondary, or contextual
Annotate with topics, frameworks, claims, and signature language
Chunk and embed by breaking content into thought-sized units and indexing for search
Store and isolate, so each persona gets its own knowledge base
Validate by sanity-checking every new source with a sample retrieval before it goes live

What every release is tested against

Core operating system: how the expert diagnoses, prioritizes, and plans
Edge cases and persona resilience: vague questions, pushback, identity challenges
Framework depth and attribution: deep knowledge with correct sourcing
Natural coaching flow: messy opener to concrete action plan
Real-world scenarios: the kind of questions users actually ask
Topic-shift and conversation agility: staying useful across thread switches