The Idea

AI models make things up. They generate confident-sounding answers about facts they don't actually know, dates they've never seen, and documents they can't access. This is the fundamental hallucination problem, and Retrieval-Augmented Generation (RAG) is the most widely deployed solution.

The concept is straightforward: before the AI answers your question, search a knowledge base for relevant passages and include them in the prompt. Now the model generates an answer grounded in actual source material rather than relying solely on what it memorized during training. The difference is like asking someone to answer from memory versus handing them the reference book first.

Building Blocks

This composition builds on:

Give It the Source Recall First Index First

RAG combines document indexing (prepare knowledge for search), retrieval (find what's relevant), and context augmentation (give the AI real sources to cite) into a production-ready pipeline.

The Pipeline

RAG isn't one step — it's a pipeline where each stage matters.

1

Chunk

Split documents into smaller pieces. Too large and they dilute the context. Too small and they lose meaning.

2

Embed

Convert each chunk into a numerical vector — a mathematical fingerprint that captures its meaning.

3

Retrieve

When a question arrives, embed it too and find the chunks with the closest vectors. Fast, but can be imprecise.

4

Rerank

Score each retrieved chunk against the actual question. This precision step improves results by 30–40%.

5

Generate

Feed the top chunks into the prompt alongside the question. The AI answers grounded in real sources.

See It in Action

An employee asking a question about their company's parental leave policy.

Q
User Question
Question
"How many weeks of parental leave do I get if I've been here for two years?"
↓ Search the company handbook
1
Retrieved Chunks
Top 3 matches from knowledge base
Chunk 1 (HR Policy 4.3): "Employees with 1+ years of tenure receive 12 weeks paid parental leave. Employees with 3+ years receive 16 weeks..."

Chunk 2 (HR Policy 4.3.1): "Parental leave applies to birth, adoption, and foster placement. Leave begins on the date of the qualifying event..."

Chunk 3 (HR Policy 4.5): "Part-time employees receive prorated leave based on average weekly hours over the previous 6 months..."
↓ Answer grounded in retrieved documents
2
Grounded Answer
AI response
"With two years of tenure, you qualify for 12 weeks of paid parental leave (per HR Policy 4.3). The 16-week tier requires 3+ years. This applies to birth, adoption, or foster placement, starting from the date of the qualifying event."

Without RAG, the AI would guess a generic answer. With RAG, it cites the exact policy.

Chunking Strategies

How you split documents is one of the biggest decisions in a RAG system. Get it wrong and even perfect retrieval won't help.

Simple

Fixed-Size

Split every 512 tokens with 50-token overlap. Easy to implement but may break mid-sentence or mid-thought.

Better

Semantic

Split by paragraph or section boundaries. Preserves meaning and context, but chunk sizes vary.

Balanced

Sentence-Based

Group 5 sentences per chunk. Natural boundaries with consistent sizes. A solid middle ground.

Advanced

Parent-Child

Small chunks for precise retrieval, but return the larger parent chunk for context. Best of both worlds.

RAG Variants

The basic "retrieve then generate" pattern has evolved into several specialized variants.

Basic

Naive RAG

Retrieve top-K chunks, stuff into prompt, generate. Simple and effective for many use cases.

Self-Aware

Self-RAG

The model decides if it needs retrieval. Skips the search for questions it already knows. Evaluates its own answers for groundedness.

Autonomous

Agentic RAG

An agent decides what, when, and how to retrieve in a loop. Can reformulate queries, try different sources, and validate results.

Common RAG Mistakes

×

Chunks too large — the relevant sentence gets buried in paragraphs of irrelevant text, diluting the signal

×

Chunks too small — you retrieve the right sentence but lose the surrounding context needed to understand it

×

No reranking — vector search is fast but imprecise; without reranking, irrelevant chunks crowd out useful ones

×

Wrong embedding model — a general-purpose embedding may not capture domain-specific terminology well

Why This Works

RAG works because it plays to the strengths of both search and generation. Search engines are excellent at finding relevant documents but terrible at synthesizing answers. Language models are excellent at synthesis but unreliable at recalling specific facts. RAG combines them: let search handle the facts, let the model handle the language.

There's a deeper reason too: grounding reduces hallucination because the model has less need to "fill in" from memory. When relevant source text is right there in the prompt, the path of least resistance is to paraphrase and cite rather than fabricate.

The Composition

Don't ask the AI to answer from memory. Search your documents first, retrieve the most relevant passages, and put them in the prompt. The model generates answers grounded in real sources — not guesses.

When to Use This

When to Skip This

How It Relates

RAG is the production evolution of the single-prompt technique Give It the Source. Where that technique manually pastes context into a prompt, RAG automates the process: finding, ranking, and inserting the right context dynamically.

It connects to ReAct when agents decide what to retrieve (Agentic RAG), to Plan-and-Execute when complex queries require multi-step retrieval strategies, and to Self-Ask when the system generates sub-questions to retrieve different aspects of an answer.