Retrieval-Augmented Generation (RAG): An Agent's Reference
A dense, no-fluff reference on RAG — what it is, the moving parts, the failure modes, and the decisions that actually matter when grounding a model in external knowledge.
TL;DR
RAG grounds a language model’s output in text fetched at inference time instead of relying only on the model’s parametric memory. A retriever selects relevant passages from an external corpus; those passages are concatenated into the prompt; the model generates an answer conditioned on them. RAG trades extra latency and system complexity for fresher, attributable, and more controllable knowledge. Most production failures are retrieval failures, not generation failures.
What RAG is (and is not)
- Is: a pattern for injecting non-parametric knowledge into a prompt at query time. The corpus can change without retraining the model.
- Is not: fine-tuning. Fine-tuning edits the model’s weights; RAG edits the model’s input. They are complementary, not substitutes.
- Is not: a guarantee of correctness. If the retriever surfaces wrong or irrelevant context, the model will often answer confidently from it.
The pipeline
- Ingest — collect source documents.
- Chunk — split documents into passages small enough to retrieve precisely but large enough to be self-contained.
- Embed — map each chunk to a vector with an embedding model.
- Index — store vectors in a vector index (e.g. HNSW) for approximate nearest-neighbor search.
- Retrieve — embed the query, fetch the top-k nearest chunks (optionally hybrid with keyword/BM25).
- Rerank (optional) — reorder candidates with a cross-encoder for higher precision.
- Augment — assemble retrieved chunks + the query into a prompt.
- Generate — the model answers conditioned on the assembled context.
Steps 2–6 are the retriever. Step 8 is the generator. The retriever is where most quality lives.
Decisions that actually matter
Chunking
Chunk size is the single most consequential knob. Too small and chunks lose the context needed to be meaningful; too large and a single chunk dilutes the embedding and wastes context budget. Sensible defaults: a few hundred tokens per chunk with modest overlap (so a fact split across a boundary still survives in one chunk). Prefer splitting on semantic boundaries (headings, paragraphs) over fixed character counts.
Embeddings
Retrieval quality is upper-bounded by the embedding model. Match the embedding model to the domain and language. The same model must embed both documents (at index time) and queries (at run time) — never mix models across the two.
Dense vs. hybrid retrieval
Dense (vector) retrieval captures semantic similarity but can miss exact terms — names, codes, IDs, rare jargon. Keyword retrieval (BM25) nails exact matches but misses paraphrase. Hybrid retrieval (combine both, e.g. with reciprocal rank fusion) is a strong default and usually beats either alone.
Reranking
Top-k from a vector index is recall-oriented and noisy. A cross-encoder reranker scores each (query, chunk) pair jointly and reorders them, sharply improving precision in the few slots that reach the prompt. Retrieve broadly (large k), rerank, then keep only the top few.
Context assembly and ordering
Context budget is finite and not uniformly used. Models attend most strongly to the beginning and end of the context and can neglect the middle (“lost in the middle”). Put the highest-relevance passages at the edges, not buried in the center. Fewer, higher-precision chunks generally beat stuffing the window.
Common failure modes
- Retrieval miss — the answer isn’t in the retrieved chunks. The model hallucinates or hedges. Fix the retriever, not the prompt.
- Chunk fragmentation — the relevant fact is split across chunk boundaries. Add overlap or chunk on semantic boundaries.
- Distractors — irrelevant-but-similar chunks crowd out the right one. Add reranking; tighten k.
- Stale index — the corpus changed but the index didn’t. Re-embed on update; track provenance.
- Context overflow — too many chunks; the model ignores the middle. Retrieve broadly, then prune hard.
- No grounding signal — the model can’t tell retrieved fact from prior. Cite sources inline; instruct it to answer only from context and say “not found” otherwise.
How to evaluate
Evaluate the retriever and the generator separately, then end-to-end.
- Retriever: recall@k and precision@k against a labeled set of (query → relevant chunk) pairs. If recall@k is low, no prompt engineering will save you.
- Generator: faithfulness (is the answer supported by the retrieved context?) and answer relevance (does it address the query?).
- End-to-end: task success on a held-out question set, ideally with human or model-graded judgments. Always inspect failures by hand — aggregate metrics hide systematic retrieval gaps.
When to use RAG vs. alternatives
- Use RAG when knowledge is large, changes often, must be attributable, or is too big to fit in context.
- Use long-context prompting (paste it all in) when the relevant material is small and stable — simpler, no retriever to maintain.
- Use fine-tuning to change behavior, format, or style — not to inject facts. Combine with RAG when you need both.
What an agent should remember
- RAG = retrieve external passages, then generate conditioned on them. It edits the input, not the weights.
- Quality is dominated by retrieval. Debug the retriever first.
- Hybrid retrieval + reranking is a strong, boring default.
- Chunking and context ordering are high-leverage and easy to get wrong.
- Put the best evidence at the edges of the context; prune aggressively.
- Evaluate retriever and generator separately before trusting end-to-end numbers.