RAG (Retrieval-Augmented Generation)

Giving a language model access to external knowledge at the moment it answers a question, instead of relying solely on what it memorised during training.

What is it?

A large language model knows only what was in its training data. That data has a cutoff date, may contain errors, and cannot include private or proprietary information that was never published. When you ask an LLM a question that requires knowledge beyond its training — yesterday’s news, your company’s internal policies, the contents of a specific document — it has two options: admit it does not know, or guess. The guessing is what produces hallucinations.¹

Retrieval-Augmented Generation (RAG) solves this by adding a retrieval step before generation. Instead of answering from memory alone, the system first searches an external knowledge source (a database, a document collection, a knowledge graph, a search engine), retrieves the most relevant information, and then feeds that information to the LLM alongside the original question. The model generates its answer grounded in the retrieved evidence rather than relying on what it memorised.²

The concept was introduced by Lewis et al. at Facebook AI Research in 2020. Their key insight was the distinction between parametric knowledge (what the model has encoded in its neural network weights during training) and non-parametric knowledge (what can be looked up at query time from an external source). RAG combines both: the model uses its parametric knowledge for language understanding and reasoning, and non-parametric knowledge for factual grounding.³

The parent concept, llm-pipelines, frames RAG as the most common tool-use pattern — a pipeline stage that extends the model beyond text-in, text-out by connecting it to external data sources. RAG is not a single tool call; it is a multi-step pipeline in itself: retrieve, augment, generate.

In plain terms

RAG is like an open-book exam. Without RAG, the LLM takes a closed-book exam — it can only use what it memorised. With RAG, the LLM gets to look things up in a reference book before answering. The answer is still in the model’s own words, but the facts come from the reference material.

At a glance

The three stages of RAG (click to expand)
graph LR
    Q[User Question] --> R[Retrieve]
    R -->|search| KB[Knowledge Base]
    KB -->|relevant docs| A[Augment]
    A -->|question + context| G[Generate]
    G --> ANS[Grounded Answer]

    R -.->|find relevant information| R
    A -.->|combine question with evidence| A
    G -.->|answer using retrieved context| G
Key: The user’s question triggers a retrieval step that searches the knowledge base. The most relevant documents are combined with the original question (augmentation). The LLM generates its answer using both the question and the retrieved evidence (generation). The answer is grounded in external knowledge, not just the model’s training data.

How does it work?

RAG operates as a three-stage pipeline. Each stage has a distinct job, and the output of each becomes the input of the next.

1. Retrieve — find the relevant information

The retrieval stage takes the user’s question and searches a knowledge base for the most relevant documents, passages, or data points. This is the “R” in RAG, and it is where the system connects to external knowledge.²

The knowledge base can be anything searchable: a vector database of document embeddings, a traditional search index, a knowledge graph, a SQL database accessed via an API, or even a live web search engine. What matters is that the retrieval mechanism returns content that is relevant to the question.⁴

The most common approach uses vector similarity search. Documents in the knowledge base are converted into numerical representations (embeddings) that capture their meaning. The user’s question is also converted into an embedding. The system then finds the documents whose embeddings are closest to the question’s embedding — the ones most semantically similar.⁴

Think of it like...

A librarian who hears your question, goes into the stacks, and comes back with the three most relevant books opened to the right pages. The librarian does not answer the question — they find the material that contains the answer.

Example: retrieval in action (click to expand)

Consider a company with 10,000 internal policy documents. A user asks: “What is our parental leave policy?”

Step What happens
Embed the question Convert “What is our parental leave policy?” into a vector
Search the knowledge base Find the 3-5 documents whose vectors are most similar
Return results HR Policy v4.2, section 12 (parental leave); Employee Handbook, chapter 8; Benefits FAQ, question 47

The retrieval step does not understand the content — it finds the documents most likely to contain the answer based on semantic similarity.

Step	What happens
Embed the question	Convert “What is our parental leave policy?” into a vector
Search the knowledge base	Find the 3-5 documents whose vectors are most similar
Return results	HR Policy v4.2, section 12 (parental leave); Employee Handbook, chapter 8; Benefits FAQ, question 47

2. Augment — combine the question with the evidence

The augmentation stage takes the retrieved documents and combines them with the original question into a single prompt that the LLM will process. This is where the “A” in RAG happens: the model’s context is augmented with external knowledge.²

A typical augmented prompt looks like this:

Based on the following documents, answer the user's question.

[Document 1: HR Policy v4.2, Section 12...]
[Document 2: Employee Handbook, Chapter 8...]
[Document 3: Benefits FAQ, Question 47...]

Question: What is our parental leave policy?

The augmentation step is where context-cascading meets RAG. The retrieved documents become part of the task context — a dynamically assembled layer that changes with every question. The quality of this step depends on how well the retrieved documents are selected, ranked, and formatted.⁵

Think of it like...

The librarian places the relevant books on your desk, open to the right pages, and says “Here is what I found — now answer your question using these.” The books are the augmentation; your reading and answering is the generation.

Key distinction

Augmentation is not just concatenation. Good augmentation involves ranking documents by relevance, truncating to fit the model’s context window, and formatting the context so the model can easily distinguish source material from the question. Poor augmentation dumps irrelevant text into the prompt and degrades the answer.

3. Generate — answer using the retrieved context

The generation stage is where the LLM produces its response. The model receives the augmented prompt — the question plus the retrieved evidence — and generates an answer grounded in that evidence. This is the “G” in RAG.²

The model is not just regurgitating the retrieved text. It is synthesising, summarising, and reasoning over the evidence to produce a coherent answer in natural language. It uses its parametric knowledge (language understanding, reasoning ability, writing skill) combined with the non-parametric knowledge (the retrieved documents) to produce an answer that is both fluent and factually grounded.³

When RAG works well, the model can cite its sources: “According to HR Policy v4.2, parental leave is 16 weeks…” This traceability is one of RAG’s most valuable properties — the user can verify the answer against the original source.¹

Example: generation with and without RAG (click to expand)

Without RAG (closed-book): “Parental leave policies vary by company. Typically, companies offer 12-16 weeks…” (generic, possibly wrong for this specific company)

With RAG (open-book): “According to HR Policy v4.2, section 12, our parental leave policy provides 16 weeks of paid leave for primary caregivers and 4 weeks for secondary caregivers, effective from the date of birth or adoption.” (specific, traceable, grounded in the actual document)

Parametric vs non-parametric knowledge

This distinction is central to understanding why RAG exists and what it solves.³

	Parametric knowledge	Non-parametric knowledge
Where it lives	Encoded in the model’s weights	Stored in an external source
When it was learned	During training (has a cutoff date)	Retrieved at query time (always current)
Can it be updated?	Only by retraining (expensive)	Yes, by updating the knowledge base (cheap)
Scope	Whatever was in the training data	Whatever you put in the knowledge base
Example	”Python is a programming language"	"Our Q3 revenue was CHF 2.4 million”

RAG’s power comes from combining both. The model uses parametric knowledge to understand language, reason, and generate fluent text. It uses non-parametric knowledge to ground its answers in specific, current, verifiable facts.³

Key distinction

Retraining a model to update its knowledge is like reprinting an entire textbook to fix one fact. RAG is like giving the textbook reader access to a search engine — the textbook stays the same, but the reader can always look up the latest information.

Why RAG reduces hallucination

LLMs hallucinate when they generate plausible-sounding text that is not grounded in fact. This happens most often when the model is asked about topics outside its training data or when it lacks confidence and fills the gap with statistically likely but factually wrong text.¹

RAG reduces hallucination by giving the model evidence to anchor its response. When the prompt contains the actual text of a policy document, the model is far less likely to invent a policy. The retrieved context acts as a constraint — it narrows the space of plausible answers from “anything that sounds right” to “anything supported by this evidence.”⁵

RAG does not eliminate hallucination entirely. The model can still misinterpret the retrieved documents, ignore relevant passages, or generate answers that go beyond what the evidence supports. But empirical results consistently show that RAG systems produce significantly fewer factual errors than models operating without retrieval.¹

Why do we use it?

Key reasons

1. Access to current information. Training data has a cutoff date. RAG lets the model answer questions about events, documents, and data that did not exist when it was trained, by retrieving current information at query time.¹

2. Reduced hallucination. By grounding answers in retrieved evidence, RAG significantly reduces the frequency of the model generating plausible but false information. The evidence acts as a factual constraint.⁵

3. Verifiable answers. RAG enables citation and source attribution. The user can trace the answer back to the original document, which builds trust and enables fact-checking.²

4. Domain-specific knowledge without retraining. Instead of fine-tuning or retraining a model on proprietary data (expensive and time-consuming), you can give it access to that data through retrieval. Update the knowledge base and the model immediately has access to the new information.⁴

When do we use it?

When the LLM needs to answer questions about private or proprietary data (company policies, internal documentation, customer records)
When factual accuracy matters and hallucination is unacceptable (legal, medical, financial contexts)
When the required knowledge changes frequently and retraining is impractical
When users need to verify answers against original sources
When building documentation chatbots, enterprise search, or knowledge assistants
When augmenting a general-purpose model with domain-specific expertise

Rule of thumb

If the answer to a question exists in a specific document or database and the LLM does not have that document in its training data, RAG is the right pattern. If the question requires general reasoning or creative generation rather than factual recall, RAG adds complexity without much benefit.

How can I think about it?

The open-book exam

RAG is an open-book exam for an LLM.

In a closed-book exam (standard LLM), the student relies entirely on memory. If they studied the right material, they answer well. If not, they guess — and the guesses sound confident but may be wrong.

In an open-book exam (RAG), the student can look up information in reference materials during the test. They still need to understand the question, find the right pages, and synthesise an answer — but the facts come from the book, not from memory.

The retrieval step is flipping to the right chapter. The augmentation step is reading the relevant passages. The generation step is writing the answer in your own words based on what you read.

The quality of the exam depends on both the student’s skill (the model) and the quality of the reference books (the knowledge base). A brilliant student with a bad textbook still struggles.

The research assistant

RAG is like having a research assistant who works alongside you.

You ask a question: “What were the key findings of the 2025 climate report?”

Your assistant goes to the library (retrieval), finds the report, and brings back the relevant sections

They place the excerpts on your desk with the question at the top (augmentation)

You read the excerpts and write a summary in your own words (generation)

If someone challenges your summary, you can point to the original report as your source (citation and verifiability)

Without the assistant, you would have to answer from memory — and if you never read the report, you would be guessing

The assistant does not write the summary. They ensure you have the right materials to write an accurate one. That is exactly what the retrieval step does for the LLM.

NotebookLM as a RAG system (click to expand)

Google’s NotebookLM is a consumer-facing RAG implementation. You upload documents (PDFs, notes, web pages) as your knowledge base. When you ask NotebookLM a question, it retrieves relevant passages from your uploaded documents and generates an answer grounded in that specific material — not from its general training data. It even provides citations showing which document and passage each claim comes from. This is RAG in action: your documents are the non-parametric knowledge, and the model’s language ability is the parametric knowledge.

Yiuno example: graph.json as a retrieval source (click to expand)

This knowledge system’s graph.json file is a structured knowledge base that could power a RAG system. It stores every concept, its relationships, prerequisites, and metadata in JSON format. An LLM-powered learning assistant could:

Retrieve — search graph.json for concepts related to a user’s question

Augment — load the relevant concept cards as context

Generate — answer the question using the actual card content, not generic training data

The graph structure enables targeted retrieval: instead of searching every card, the system can traverse relationships (parent, children, prerequisites) to find the most relevant context.

Concepts to explore next

Concept	What it covers	Status
knowledge-graphs	The structured data sources that RAG systems can retrieve from	complete
json	The data format commonly used to store and exchange retrieved knowledge	complete
apis	The interfaces through which RAG systems query external data sources	complete
databases	The storage systems that serve as knowledge bases for retrieval	stub

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain what RAG does and why it exists. What problem does it solve that a standard LLM cannot?

Name the three stages of a RAG pipeline and describe what each one does.

Distinguish between parametric knowledge and non-parametric knowledge. Why does this distinction matter for understanding when RAG is useful?

Interpret this scenario: a RAG-powered documentation chatbot retrieves the correct policy document but generates an answer that contradicts the document. Where in the pipeline did the error occur, and what might have caused it?

Connect RAG to the concept of knowledge graphs. How could a knowledge graph improve the retrieval step compared to a simple keyword search?

Where this concept fits

Position in the knowledge graph
graph TD
    LP[LLM Pipelines] --> PR[Prompt Routing]
    LP --> CC[Context Cascading]
    LP --> RAG[RAG]
    KG[Knowledge Graphs] -.->|prerequisite| RAG
    JSON[JSON] -.->|prerequisite| RAG
    style RAG fill:#4a9ede,color:#fff
Related concepts:

apis — RAG systems use APIs to query external knowledge sources (search engines, databases, document stores)

databases — the storage layer that holds the knowledge base a RAG system retrieves from

machine-readable-formats — retrieved data must be in a format the system can parse and inject into the prompt

context-cascading — the augmentation step in RAG is a dynamic form of context loading, assembling task-specific context at query time

Explorer

RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation)

What is it?

At a glance

How does it work?

1. Retrieve — find the relevant information

2. Augment — combine the question with the evidence

3. Generate — answer using the retrieved context

Parametric vs non-parametric knowledge

Why RAG reduces hallucination

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation)

What is it?

At a glance

How does it work?

1. Retrieve — find the relevant information

2. Augment — combine the question with the evidence

3. Generate — answer using the retrieved context

Parametric vs non-parametric knowledge

Why RAG reduces hallucination

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks