The Idea
AI models make things up. They generate confident-sounding answers about facts they don't actually know, dates they've never seen, and documents they can't access. This is the fundamental hallucination problem, and Retrieval-Augmented Generation (RAG) is the most widely deployed solution.
The concept is straightforward: before the AI answers your question, search a knowledge base for relevant passages and include them in the prompt. Now the model generates an answer grounded in actual source material rather than relying solely on what it memorized during training. The difference is like asking someone to answer from memory versus handing them the reference book first.
Building Blocks
This composition builds on:
Give It the Source Recall First Index FirstRAG combines document indexing (prepare knowledge for search), retrieval (find what's relevant), and context augmentation (give the AI real sources to cite) into a production-ready pipeline.
The Pipeline
RAG isn't one step — it's a pipeline where each stage matters.
Chunk
Split documents into smaller pieces. Too large and they dilute the context. Too small and they lose meaning.
Embed
Convert each chunk into a numerical vector — a mathematical fingerprint that captures its meaning.
Retrieve
When a question arrives, embed it too and find the chunks with the closest vectors. Fast, but can be imprecise.
Rerank
Score each retrieved chunk against the actual question. This precision step improves results by 30–40%.
Generate
Feed the top chunks into the prompt alongside the question. The AI answers grounded in real sources.
See It in Action
An employee asking a question about their company's parental leave policy.
Chunk 2 (HR Policy 4.3.1): "Parental leave applies to birth, adoption, and foster placement. Leave begins on the date of the qualifying event..."
Chunk 3 (HR Policy 4.5): "Part-time employees receive prorated leave based on average weekly hours over the previous 6 months..."
Without RAG, the AI would guess a generic answer. With RAG, it cites the exact policy.
Chunking Strategies
How you split documents is one of the biggest decisions in a RAG system. Get it wrong and even perfect retrieval won't help.
Fixed-Size
Split every 512 tokens with 50-token overlap. Easy to implement but may break mid-sentence or mid-thought.
Semantic
Split by paragraph or section boundaries. Preserves meaning and context, but chunk sizes vary.
Sentence-Based
Group 5 sentences per chunk. Natural boundaries with consistent sizes. A solid middle ground.
Parent-Child
Small chunks for precise retrieval, but return the larger parent chunk for context. Best of both worlds.
RAG Variants
The basic "retrieve then generate" pattern has evolved into several specialized variants.
Naive RAG
Retrieve top-K chunks, stuff into prompt, generate. Simple and effective for many use cases.
Self-RAG
The model decides if it needs retrieval. Skips the search for questions it already knows. Evaluates its own answers for groundedness.
Agentic RAG
An agent decides what, when, and how to retrieve in a loop. Can reformulate queries, try different sources, and validate results.
Common RAG Mistakes
Chunks too large — the relevant sentence gets buried in paragraphs of irrelevant text, diluting the signal
Chunks too small — you retrieve the right sentence but lose the surrounding context needed to understand it
No reranking — vector search is fast but imprecise; without reranking, irrelevant chunks crowd out useful ones
Wrong embedding model — a general-purpose embedding may not capture domain-specific terminology well
Why This Works
RAG works because it plays to the strengths of both search and generation. Search engines are excellent at finding relevant documents but terrible at synthesizing answers. Language models are excellent at synthesis but unreliable at recalling specific facts. RAG combines them: let search handle the facts, let the model handle the language.
There's a deeper reason too: grounding reduces hallucination because the model has less need to "fill in" from memory. When relevant source text is right there in the prompt, the path of least resistance is to paraphrase and cite rather than fabricate.
The Composition
Don't ask the AI to answer from memory. Search your documents first, retrieve the most relevant passages, and put them in the prompt. The model generates answers grounded in real sources — not guesses.
When to Use This
- • Enterprise knowledge bases — internal docs, policies, product catalogs
- • Customer support where answers must be accurate and citable
- • Questions about information beyond the model's training cutoff
- • Domain-specific applications (legal, medical, financial) where precision matters
- • Any situation where "I don't know" is better than a confident wrong answer
When to Skip This
- • Creative tasks — grounding can constrain imagination; brainstorming doesn't need citations
- • Common knowledge — the overhead of retrieval isn't worth it for questions the model already knows well
- • Tiny document sets — if everything fits in the context window, just include it directly
- • Poorly curated sources — RAG grounded in bad documents produces confidently wrong answers with citations
How It Relates
RAG is the production evolution of the single-prompt technique Give It the Source. Where that technique manually pastes context into a prompt, RAG automates the process: finding, ranking, and inserting the right context dynamically.
It connects to ReAct when agents decide what to retrieve (Agentic RAG), to Plan-and-Execute when complex queries require multi-step retrieval strategies, and to Self-Ask when the system generates sub-questions to retrieve different aspects of an answer.