RAG (Retrieval-Augmented Generation)
Giving a language model access to external knowledge at the moment it answers a question, instead of relying solely on what it memorised during training.
What is it?
A large language model knows only what was in its training data. That data has a cutoff date, may contain errors, and cannot include private or proprietary information that was never published. When you ask an LLM a question that requires knowledge beyond its training — yesterday’s news, your company’s internal policies, the contents of a specific document — it has two options: admit it does not know, or guess. The guessing is what produces hallucinations.1
Retrieval-Augmented Generation (RAG) solves this by adding a retrieval step before generation. Instead of answering from memory alone, the system first searches an external knowledge source (a database, a document collection, a knowledge graph, a search engine), retrieves the most relevant information, and then feeds that information to the LLM alongside the original question. The model generates its answer grounded in the retrieved evidence rather than relying on what it memorised.2
The concept was introduced by Lewis et al. at Facebook AI Research in 2020. Their key insight was the distinction between parametric knowledge (what the model has encoded in its neural network weights during training) and non-parametric knowledge (what can be looked up at query time from an external source). RAG combines both: the model uses its parametric knowledge for language understanding and reasoning, and non-parametric knowledge for factual grounding.3
The parent concept, llm-pipelines, frames RAG as the most common tool-use pattern — a pipeline stage that extends the model beyond text-in, text-out by connecting it to external data sources. RAG is not a single tool call; it is a multi-step pipeline in itself: retrieve, augment, generate.
In plain terms
RAG is like an open-book exam. Without RAG, the LLM takes a closed-book exam — it can only use what it memorised. With RAG, the LLM gets to look things up in a reference book before answering. The answer is still in the model’s own words, but the facts come from the reference material.
At a glance
The three stages of RAG (click to expand)
graph LR Q[User Question] --> R[Retrieve] R -->|search| KB[Knowledge Base] KB -->|relevant docs| A[Augment] A -->|question + context| G[Generate] G --> ANS[Grounded Answer] R -.->|find relevant information| R A -.->|combine question with evidence| A G -.->|answer using retrieved context| GKey: The user’s question triggers a retrieval step that searches the knowledge base. The most relevant documents are combined with the original question (augmentation). The LLM generates its answer using both the question and the retrieved evidence (generation). The answer is grounded in external knowledge, not just the model’s training data.
How does it work?
RAG operates as a three-stage pipeline. Each stage has a distinct job, and the output of each becomes the input of the next.
1. Retrieve — find the relevant information
The retrieval stage takes the user’s question and searches a knowledge base for the most relevant documents, passages, or data points. This is the “R” in RAG, and it is where the system connects to external knowledge.2
The knowledge base can be anything searchable: a vector database of document embeddings, a traditional search index, a knowledge graph, a SQL database accessed via an API, or even a live web search engine. What matters is that the retrieval mechanism returns content that is relevant to the question.4
The most common approach uses vector similarity search. Documents in the knowledge base are converted into numerical representations (embeddings) that capture their meaning. The user’s question is also converted into an embedding. The system then finds the documents whose embeddings are closest to the question’s embedding — the ones most semantically similar.4
Think of it like...
A librarian who hears your question, goes into the stacks, and comes back with the three most relevant books opened to the right pages. The librarian does not answer the question — they find the material that contains the answer.
Example: retrieval in action (click to expand)
Consider a company with 10,000 internal policy documents. A user asks: “What is our parental leave policy?”
Step What happens Embed the question Convert “What is our parental leave policy?” into a vector Search the knowledge base Find the 3-5 documents whose vectors are most similar Return results HR Policy v4.2, section 12 (parental leave); Employee Handbook, chapter 8; Benefits FAQ, question 47 The retrieval step does not understand the content — it finds the documents most likely to contain the answer based on semantic similarity.
2. Augment — combine the question with the evidence
The augmentation stage takes the retrieved documents and combines them with the original question into a single prompt that the LLM will process. This is where the “A” in RAG happens: the model’s context is augmented with external knowledge.2
A typical augmented prompt looks like this:
Based on the following documents, answer the user's question.
[Document 1: HR Policy v4.2, Section 12...]
[Document 2: Employee Handbook, Chapter 8...]
[Document 3: Benefits FAQ, Question 47...]
Question: What is our parental leave policy?
The augmentation step is where context-cascading meets RAG. The retrieved documents become part of the task context — a dynamically assembled layer that changes with every question. The quality of this step depends on how well the retrieved documents are selected, ranked, and formatted.5
Think of it like...
The librarian places the relevant books on your desk, open to the right pages, and says “Here is what I found — now answer your question using these.” The books are the augmentation; your reading and answering is the generation.
Key distinction
Augmentation is not just concatenation. Good augmentation involves ranking documents by relevance, truncating to fit the model’s context window, and formatting the context so the model can easily distinguish source material from the question. Poor augmentation dumps irrelevant text into the prompt and degrades the answer.
3. Generate — answer using the retrieved context
The generation stage is where the LLM produces its response. The model receives the augmented prompt — the question plus the retrieved evidence — and generates an answer grounded in that evidence. This is the “G” in RAG.2
The model is not just regurgitating the retrieved text. It is synthesising, summarising, and reasoning over the evidence to produce a coherent answer in natural language. It uses its parametric knowledge (language understanding, reasoning ability, writing skill) combined with the non-parametric knowledge (the retrieved documents) to produce an answer that is both fluent and factually grounded.3
When RAG works well, the model can cite its sources: “According to HR Policy v4.2, parental leave is 16 weeks…” This traceability is one of RAG’s most valuable properties — the user can verify the answer against the original source.1
Example: generation with and without RAG (click to expand)
Without RAG (closed-book): “Parental leave policies vary by company. Typically, companies offer 12-16 weeks…” (generic, possibly wrong for this specific company)
With RAG (open-book): “According to HR Policy v4.2, section 12, our parental leave policy provides 16 weeks of paid leave for primary caregivers and 4 weeks for secondary caregivers, effective from the date of birth or adoption.” (specific, traceable, grounded in the actual document)
Parametric vs non-parametric knowledge
This distinction is central to understanding why RAG exists and what it solves.3
| Parametric knowledge | Non-parametric knowledge | |
|---|---|---|
| Where it lives | Encoded in the model’s weights | Stored in an external source |
| When it was learned | During training (has a cutoff date) | Retrieved at query time (always current) |
| Can it be updated? | Only by retraining (expensive) | Yes, by updating the knowledge base (cheap) |
| Scope | Whatever was in the training data | Whatever you put in the knowledge base |
| Example | ”Python is a programming language" | "Our Q3 revenue was CHF 2.4 million” |
RAG’s power comes from combining both. The model uses parametric knowledge to understand language, reason, and generate fluent text. It uses non-parametric knowledge to ground its answers in specific, current, verifiable facts.3
Key distinction
Retraining a model to update its knowledge is like reprinting an entire textbook to fix one fact. RAG is like giving the textbook reader access to a search engine — the textbook stays the same, but the reader can always look up the latest information.
Why RAG reduces hallucination
LLMs hallucinate when they generate plausible-sounding text that is not grounded in fact. This happens most often when the model is asked about topics outside its training data or when it lacks confidence and fills the gap with statistically likely but factually wrong text.1
RAG reduces hallucination by giving the model evidence to anchor its response. When the prompt contains the actual text of a policy document, the model is far less likely to invent a policy. The retrieved context acts as a constraint — it narrows the space of plausible answers from “anything that sounds right” to “anything supported by this evidence.”5
RAG does not eliminate hallucination entirely. The model can still misinterpret the retrieved documents, ignore relevant passages, or generate answers that go beyond what the evidence supports. But empirical results consistently show that RAG systems produce significantly fewer factual errors than models operating without retrieval.1
Why do we use it?
Key reasons
1. Access to current information. Training data has a cutoff date. RAG lets the model answer questions about events, documents, and data that did not exist when it was trained, by retrieving current information at query time.1
2. Reduced hallucination. By grounding answers in retrieved evidence, RAG significantly reduces the frequency of the model generating plausible but false information. The evidence acts as a factual constraint.5
3. Verifiable answers. RAG enables citation and source attribution. The user can trace the answer back to the original document, which builds trust and enables fact-checking.2
4. Domain-specific knowledge without retraining. Instead of fine-tuning or retraining a model on proprietary data (expensive and time-consuming), you can give it access to that data through retrieval. Update the knowledge base and the model immediately has access to the new information.4
When do we use it?
- When the LLM needs to answer questions about private or proprietary data (company policies, internal documentation, customer records)
- When factual accuracy matters and hallucination is unacceptable (legal, medical, financial contexts)
- When the required knowledge changes frequently and retraining is impractical
- When users need to verify answers against original sources
- When building documentation chatbots, enterprise search, or knowledge assistants
- When augmenting a general-purpose model with domain-specific expertise
Rule of thumb
If the answer to a question exists in a specific document or database and the LLM does not have that document in its training data, RAG is the right pattern. If the question requires general reasoning or creative generation rather than factual recall, RAG adds complexity without much benefit.
How can I think about it?
The open-book exam
RAG is an open-book exam for an LLM.
- In a closed-book exam (standard LLM), the student relies entirely on memory. If they studied the right material, they answer well. If not, they guess — and the guesses sound confident but may be wrong.
- In an open-book exam (RAG), the student can look up information in reference materials during the test. They still need to understand the question, find the right pages, and synthesise an answer — but the facts come from the book, not from memory.
- The retrieval step is flipping to the right chapter. The augmentation step is reading the relevant passages. The generation step is writing the answer in your own words based on what you read.
- The quality of the exam depends on both the student’s skill (the model) and the quality of the reference books (the knowledge base). A brilliant student with a bad textbook still struggles.
The research assistant
RAG is like having a research assistant who works alongside you.
- You ask a question: “What were the key findings of the 2025 climate report?”
- Your assistant goes to the library (retrieval), finds the report, and brings back the relevant sections
- They place the excerpts on your desk with the question at the top (augmentation)
- You read the excerpts and write a summary in your own words (generation)
- If someone challenges your summary, you can point to the original report as your source (citation and verifiability)
- Without the assistant, you would have to answer from memory — and if you never read the report, you would be guessing
The assistant does not write the summary. They ensure you have the right materials to write an accurate one. That is exactly what the retrieval step does for the LLM.
NotebookLM as a RAG system (click to expand)
Google’s NotebookLM is a consumer-facing RAG implementation. You upload documents (PDFs, notes, web pages) as your knowledge base. When you ask NotebookLM a question, it retrieves relevant passages from your uploaded documents and generates an answer grounded in that specific material — not from its general training data. It even provides citations showing which document and passage each claim comes from. This is RAG in action: your documents are the non-parametric knowledge, and the model’s language ability is the parametric knowledge.
Yiuno example: graph.json as a retrieval source (click to expand)
This knowledge system’s
graph.jsonfile is a structured knowledge base that could power a RAG system. It stores every concept, its relationships, prerequisites, and metadata in JSON format. An LLM-powered learning assistant could:
- Retrieve — search
graph.jsonfor concepts related to a user’s question- Augment — load the relevant concept cards as context
- Generate — answer the question using the actual card content, not generic training data
The graph structure enables targeted retrieval: instead of searching every card, the system can traverse relationships (parent, children, prerequisites) to find the most relevant context.
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| knowledge-graphs | The structured data sources that RAG systems can retrieve from | complete |
| json | The data format commonly used to store and exchange retrieved knowledge | complete |
| apis | The interfaces through which RAG systems query external data sources | complete |
| databases | The storage systems that serve as knowledge bases for retrieval | stub |
Some cards don't exist yet
A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain what RAG does and why it exists. What problem does it solve that a standard LLM cannot?
- Name the three stages of a RAG pipeline and describe what each one does.
- Distinguish between parametric knowledge and non-parametric knowledge. Why does this distinction matter for understanding when RAG is useful?
- Interpret this scenario: a RAG-powered documentation chatbot retrieves the correct policy document but generates an answer that contradicts the document. Where in the pipeline did the error occur, and what might have caused it?
- Connect RAG to the concept of knowledge graphs. How could a knowledge graph improve the retrieval step compared to a simple keyword search?
Where this concept fits
Position in the knowledge graph
graph TD LP[LLM Pipelines] --> PR[Prompt Routing] LP --> CC[Context Cascading] LP --> RAG[RAG] KG[Knowledge Graphs] -.->|prerequisite| RAG JSON[JSON] -.->|prerequisite| RAG style RAG fill:#4a9ede,color:#fffRelated concepts:
- apis — RAG systems use APIs to query external knowledge sources (search engines, databases, document stores)
- databases — the storage layer that holds the knowledge base a RAG system retrieves from
- machine-readable-formats — retrieved data must be in a format the system can parse and inject into the prompt
- context-cascading — the augmentation step in RAG is a dynamic form of context loading, assembling task-specific context at query time
Sources
Further reading
Resources
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — The original RAG paper from Facebook AI Research that introduced the concept and the parametric/non-parametric distinction
- What is RAG? (AWS) — Clear, authoritative explanation from AWS with architecture diagrams and use cases
- RAG Explained: Complete Guide 2026 (AI Agents Plus) — Comprehensive 2026 guide covering architecture, implementation patterns, and common pitfalls
- RAG Architecture Explained for Engineers (AI Engineer Lab) — Production-focused guide covering vector databases, embedding strategies, and retrieval optimisation
- Retrieval-Augmented Generation for Beginners (ADevGuide) — Beginner-friendly walkthrough with step-by-step examples
Footnotes
-
AWS. (2026). What is RAG? Retrieval-Augmented Generation AI Explained. Amazon Web Services. ↩ ↩2 ↩3 ↩4 ↩5
-
AI Agents Plus Editorial. (2026). RAG Explained: Complete Retrieval Augmented Generation Guide 2026. AI Agents Plus. ↩ ↩2 ↩3 ↩4 ↩5
-
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. ↩ ↩2 ↩3 ↩4
-
AI Engineer Lab. (2026). RAG Architecture Explained for Engineers. AI Engineer Lab. ↩ ↩2 ↩3
-
PE Collective. (2026). RAG Architecture: How to Build Retrieval-Augmented Generation Systems. PE Collective. ↩ ↩2 ↩3