Inside a Production-Grade RAG Architecture: Retrieval, Reranking, Guardrails, and Evals
Production-grade RAG is not one model call with context pasted above a question. It is an architecture that decides what evidence can be trusted, which answer should be generated, when to refuse, and how to improve over time.
A strong architecture separates the system into stages: retrieval, reranking, context assembly, generation, guardrails, and evaluation. Each stage should produce data that can be logged, tested, and improved independently.
query
│
▼
┌─────────────┐ recall ┌──────────────┐ precision ┌──────────────┐
│ RETRIEVAL │────────────▶│ RERANKING │────────────▶│ CONTEXT │
│ lexical + │ ~30-60 │ cross-encoder│ top 5-8 │ ASSEMBLY │
│ vector + ACL│ candidates │ + threshold │ evidence │ cite + budget│
└─────────────┘ └──────────────┘ └──────┬───────┘
│
┌───────────────────────────────────────────────┐ ▼
│ EVALS (offline + per-answer trace) │ ┌──────────┐
│ recall · faithfulness · refusal accuracy │◀─│GUARDRAILS│
└───────────────────────────────────────────────┘ │ gate + │
│ answer/ │
│ refuse │
└──────────┘
This article is the implementation blueprint for that diagram: what each stage decides, where the latency budget goes, and how much architecture is worth building at each stage of maturity.
Retrieval finds candidates
Retrieval should optimize for recall. The first stage is allowed to return more candidates than the model can use because reranking will narrow them down.
type RetrievalCandidate = {
chunkId: string;
text: string;
source: string;
score: number;
};
async function retrieveCandidates(query: string, userGroups: string[]) {
const lexical = await keywordSearch(query, userGroups, 30);
const semantic = await vectorSearch(query, userGroups, 30);
return dedupeByChunkId([...lexical, ...semantic]);
}
Hybrid retrieval is valuable because enterprise language is messy. Users search with acronyms, product names, policy names, ticket IDs, and informal wording. Lexical and semantic search catch different failure modes.
Reranking chooses evidence
Reranking should optimize for precision. The reranker receives the candidate list and returns the chunks most likely to answer the question. This step is where the system should become selective.
async function selectEvidence(query: string, candidates: RetrievalCandidate[]) {
const ranked = await rerank(query, candidates);
return ranked
.filter((item) => item.relevanceScore >= 0.62)
.slice(0, 8);
}
The threshold matters. If it is too low, the model receives weak evidence and improvises. If it is too high, the system refuses too often. The right value comes from eval data, not intuition.
The specific numbers in this article — 0.62, 0.72, 0.58 — are illustrative, not portable. Relevance scores depend on the reranker model, the embedding model, and how your index normalizes them, so a “good” threshold in one system is meaningless in another. Treat them as variables to calibrate against your own eval set, not constants to copy.
Reranking is also where you spend your latency budget. A cross-encoder reranker is the most accurate option but adds real time per query, and that cost scales with the candidate count from retrieval. A practical pattern: retrieve wide, rerank only the top candidates, and reserve the cross-encoder for cases where precision matters more than the extra latency. If you have a tight end-to-end budget, a lighter rerank-then-generate path may beat a heavier reranker that pushes total response time past what users tolerate.
Context assembly controls the answer space
The prompt should receive compact, cited evidence. Each chunk should include source metadata so the final response can cite its claims.
Context assembly also needs a budget. If the system stuffs every retrieved chunk into the prompt, the model can over-focus on irrelevant text. A smaller set of high-quality evidence is usually better than a large set of noisy evidence.
Guardrails decide whether to answer
Guardrails should run before and after generation.
Before generation, check whether the system has enough evidence. After generation, check whether each major claim is supported by citations.
function shouldAnswer(evidence: { relevanceScore: number }[]) {
if (evidence.length < 2) return false;
const bestScore = Math.max(...evidence.map((item) => item.relevanceScore));
const averageScore = evidence.reduce((sum, item) => sum + item.relevanceScore, 0) / evidence.length;
return bestScore >= 0.72 && averageScore >= 0.58;
}
The refusal path is a product feature. A reliable system should be able to say that it does not have enough evidence.
Evals close the loop
Evaluation should test the full pipeline, not only the model. A useful eval case records the question, allowed sources, expected behavior, required citations, and whether refusal is acceptable.
type RagEvalCase = {
question: string;
expectedBehavior: "answer" | "refuse";
requiredSources?: string[];
forbiddenSources?: string[];
};
Run evals when documents change, prompts change, retrieval settings change, or models change. RAG quality is a moving target because every layer can shift.
Decide how much architecture to build now
Not every system needs every stage on day one. The mistake is building enterprise machinery for an MVP, or shipping an MVP architecture into an enterprise risk profile. Match the stage to where the product actually is:
| Decision | Cheap MVP | Production | Enterprise |
|---|---|---|---|
| Retrieval | Vector only | Hybrid (lexical + vector) | Hybrid + metadata filters + reranker |
| Evals | Manual test set | CI eval suite | Continuous evals + drift monitoring |
| Guardrails | Prompt refusal | Evidence gate | Permission-aware refusal |
| Observability | Logs | Trace per answer | Audit-ready trace + retention |
The right column is not “better” in the abstract — it is more expensive to build and operate. The decision is which risks justify which spend. A startup demoing to design partners can live in the left column. A system answering policy or compliance questions for paying customers cannot.
The production mindset is simple: make each stage visible. If retrieval, reranking, guardrails, and evals are observable, the team can improve the system deliberately instead of guessing why answers changed.