Chapter 5: Retrieval-Augmented Generation (RAG): Beyond Model Knowledge

Introduction to Retrieval-Augmented Generation (RAG)

Welcome back, future Applied AI Engineer! In the previous chapters, we laid a solid foundation in Python, system thinking, and started interacting with Large Language Models (LLMs) through APIs and prompt engineering. We learned how to guide LLMs with clever prompts and even give them tools to extend their capabilities. But what if an LLM doesn’t know about the latest company policies, your personal notes, or proprietary product documentation? That’s where its “knowledge cut-off” becomes a limitation.

This chapter introduces you to a powerful technique called Retrieval-Augmented Generation (RAG). RAG is a game-changer because it allows LLMs to access, understand, and synthesize information from external, up-to-date, and domain-specific knowledge sources beyond their initial training data. Imagine giving your LLM a dynamic, ever-expanding library to consult before answering your questions – that’s RAG in a nutshell!

By the end of this chapter, you’ll not only understand the core concepts behind RAG but also implement a basic RAG system from scratch. We’ll cover data preparation, embedding models, vector databases, and how to orchestrate these components to build an intelligent system that can answer questions based on custom data. Get ready to unlock a new level of power and accuracy for your AI applications!

Core Concepts of RAG

At its heart, RAG combines the strengths of information retrieval systems with the generative power of LLMs. It addresses two major limitations of standalone LLMs:

Hallucination: LLMs can sometimes confidently generate factually incorrect information.
Lack of Specificity: They can’t access real-time or private, domain-specific data.

RAG works by first retrieving relevant information from a knowledge base and then augmenting the LLM’s prompt with this information before generating a response. Let’s break down the key components and the workflow.

The RAG Workflow: A Journey Through Knowledge

Think of RAG as an intelligent librarian for your LLM. When you ask a question, the librarian (the retrieval system) first searches through all available books (your knowledge base) to find the most relevant passages. Then, it hands those passages to a brilliant writer (the LLM) and says, “Here’s the context, now answer the user’s question using this information.”

Here’s a visual representation of the RAG process:

graph TD UserQuery[User Query] --> EmbedQuery[Embed Query] EmbedQuery --> VectorDB[Vector Database] VectorDB --> RetrieveChunks[Retrieve Relevant Chunks] RetrieveChunks --> AugmentPrompt[Augment Prompt Chunks] AugmentPrompt --> LLM[Large Language Model] LLM --> GeneratedResponse[Generated Response] subgraph Knowledge Base Preparation SourceData[Source Data] --> Chunking[Chunking] Chunking --> EmbedChunks[Embed Chunks] EmbedChunks --> StoreInVectorDB[Store in Vector Database] end SourceData -.-> KnowledgeBasePreparation

Figure 5.1: Simplified RAG Workflow Diagram

Let’s delve into each step:

1. Knowledge Base Preparation (Offline Process)

Before your RAG system can answer questions, it needs a knowledge base. This involves:

Source Data: This is your raw information – could be PDF documents, text files, web pages, database records, etc.
Chunking: LLMs have a limited “context window” (the amount of text they can process at once). Large documents need to be broken down into smaller, manageable chunks. The size of these chunks is crucial: too small, and you lose context; too large, and you might exceed the LLM’s context window or retrieve irrelevant information. A common practice is to chunk by paragraphs or fixed token counts with some overlap.
Embedding Models: An embedding model converts text (your chunks and later, user queries) into numerical vectors. These vectors are high-dimensional representations where text with similar meanings are located closer together in the vector space. This is how a computer “understands” semantic similarity.
- Why it matters: Without embeddings, a simple keyword search might miss relevant information if the wording is slightly different. Embeddings capture meaning.
Vector Databases (Vector Stores): Once your chunks are embedded, these vectors (along with their original text chunks) are stored in a specialized database optimized for fast similarity searches – a vector database. Examples include ChromaDB, Pinecone, Qdrant, Weaviate, or even local options like FAISS.
- Why it matters: This allows the system to quickly find the most semantically similar chunks to a user’s query.

2. Query Processing & Generation (Online Process)

When a user asks a question:

Embed Query: The user’s question is also converted into a vector using the same embedding model used for the chunks. Consistency is key!
Retrieve Relevant Chunks: The query’s embedding is used to perform a similarity search in the vector database. The database returns the top-K (e.g., top 3 or 5) most similar text chunks.

Augment Prompt: These retrieved chunks are then inserted into a specially crafted prompt, alongside the original user query, before being sent to the LLM. This makes the LLM aware of the specific context.

Example Prompt Structure:

"You are an expert assistant. Use the following context to answer the question.
If you don't know the answer, state that you don't know, do not make up an answer.

Context:
[Retrieved Document Chunk 1]
[Retrieved Document Chunk 2]
[Retrieved Document Chunk 3]

Question: [User's Question]

Answer:"

LLM Generation: The LLM receives this augmented prompt and generates a response based only on the provided context and its general knowledge. The instruction “If you don’t know the answer, state that you don’t know” is crucial for preventing hallucinations.

Why RAG is a Modern Best Practice (2026)

RAG has become a cornerstone of practical LLM applications for several reasons:

Accuracy & Reliability: Reduces hallucinations by grounding responses in factual, verifiable data.
Up-to-Date Information: Allows LLMs to use the latest information without requiring expensive and frequent model retraining.
Domain-Specific Knowledge: Enables LLMs to operate effectively in specialized domains (e.g., legal, medical, internal company knowledge) where pre-trained models might lack specific expertise.
Reduced Training Costs: Avoids the need to fine-tune or pre-train LLMs on custom datasets, which is often prohibitively expensive and time-consuming.
Explainability: Because the LLM’s answer is based on retrieved sources, you can often trace back why it gave a particular answer by showing the user the source chunks. This is vital for trust and debugging.
Security & Privacy: You control the knowledge base. Sensitive data can be stored securely and only exposed to the LLM via retrieval, rather than being embedded into the model itself.

Step-by-Step Implementation of a Basic RAG System

Let’s get our hands dirty and build a simple RAG system using Python. We’ll use popular libraries like langchain for orchestration, chromadb as our local vector database, and openai for the LLM and embedding models.

Prerequisites: Before we start, ensure you have Python 3.9+ installed and an OpenAI API key. If you don’t have one, you can get it from OpenAI’s website.

Step 1: Set Up Your Environment

First, let’s create a new project directory and install the necessary libraries.

Create a project folder:
```
mkdir my_rag_app
cd my_rag_app
```
Create a virtual environment (best practice!):
```
python -m venv .venv
```
Activate the virtual environment:
- On macOS/Linux:
```
source .venv/bin/activate
```
- On Windows:
```
.venv\Scripts\activate
```
Install dependencies: As of January 2026, these are stable and widely used versions for RAG development:
```
pip install langchain==0.1.0 openai==1.10.0 chromadb==0.4.22 pypdf==4.0.0 python-dotenv==1.0.1
```
- langchain: The framework for building LLM applications. Version 0.1.0 is a recent major update simplifying API.
- openai: The official client library to interact with OpenAI’s models.
- chromadb: A lightweight, open-source vector database perfect for local development.
- pypdf: A library to parse PDF documents.
- python-dotenv: To safely load environment variables like your API key.
Set up your OpenAI API Key: Create a file named .env in your my_rag_app directory and add your API key:
```
OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY_HERE"
```
Replace "sk-YOUR_OPENAI_API_KEY_HERE" with your actual key. Remember to keep this file out of version control (e.g., add .env to your .gitignore).

Step 2: Prepare Your Knowledge Base

We’ll use a simple PDF document as our knowledge source. For this example, let’s imagine we have a short PDF about “Introduction to AI Agents”. If you don’t have one, you can quickly create a dummy PDF with some text using any word processor or online tool. Save it as ai_agents_intro.pdf in your my_rag_app directory.

Here’s some example text you could put in ai_agents_intro.pdf:

Title: An Introduction to AI Agents

AI agents are autonomous software entities designed to perceive their environment, make decisions, and take actions to achieve specific goals. They represent a significant evolution in AI, moving beyond simple task automation to more complex, goal-oriented behavior.

Key characteristics of AI agents include:
1. Autonomy: Agents can operate independently without constant human intervention.
2. Reactivity: They respond to changes in their environment.
3. Pro-activeness: Agents can initiate actions to achieve goals.
4. Social Ability: They can interact with other agents or humans.

Modern AI agent frameworks often leverage Large Language Models (LLMs) as their "brain." These LLMs provide the reasoning capabilities, allowing agents to understand natural language instructions, plan tasks, and generate responses. Tools, function calling, and memory are crucial components that extend an agent's capabilities.

Tool use allows agents to interact with external systems, such as search engines, databases, or APIs. Function calling is a mechanism for LLMs to invoke specific functions based on their understanding of a user's request. Memory enables agents to retain context and learn from past interactions, making them more effective over time.

The future of AI is increasingly agentic, with applications ranging from personal assistants that manage your schedule and emails to complex systems that automate business processes and scientific discovery.

Now, let’s write the Python code to process this PDF and set up our vector database. Create a file named rag_setup.py:

# rag_setup.py
import os
from dotenv import load_dotenv

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

print("Loading environment variables...")
load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables.")
print("Environment variables loaded.")

# --- 1. Load Data ---
print("Loading PDF document...")
# This loader handles PDF files. Langchain has many loaders for different data types.
loader = PyPDFLoader("ai_agents_intro.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from the PDF.")

# --- 2. Chunk Data ---
print("Splitting documents into chunks...")
# RecursiveCharacterTextSplitter tries to split by paragraphs, then sentences, then words.
# We aim for chunks of about 1000 characters with a 200-character overlap to maintain context.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

# --- 3. Create Embeddings & Store in Vector Database ---
print("Creating embeddings and storing in ChromaDB...")
# We use OpenAI's text-embedding-ada-002 model for embeddings.
# This model is cost-effective and performs well.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Initialize ChromaDB. We'll store our embeddings and chunks here.
# persist_directory tells Chroma to save the database to disk, so we don't re-process every time.
# If you don't specify it, it will create an in-memory database that gets wiped after the script runs.
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db" # The directory where ChromaDB will store its data
)
print("ChromaDB initialized and populated with document embeddings.")
print("Knowledge base setup complete!")

# Optional: You can explicitly persist the client if you want to ensure writes are flushed
# Although from_documents usually handles this.
# vector_db.persist()
# print("Vector database persisted to disk.")

Run this script from your terminal:

python rag_setup.py

You should see output indicating the PDF was loaded, chunked, and stored in ChromaDB. A new directory named chroma_db will be created in your project folder. This is your persistent vector database!

Step 3: Implement the RAG Query Logic

Now that our knowledge base is ready, let’s create a script to query it and generate answers using an LLM. Create a file named rag_query.py:

# rag_query.py
import os
from dotenv import load_dotenv

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

print("Loading environment variables...")
load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables.")
print("Environment variables loaded.")

# --- 1. Load our existing Vector Database ---
print("Loading ChromaDB from disk...")
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vector_db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
print("ChromaDB loaded.")

# --- 2. Create a Retriever ---
# A retriever is an interface that returns documents given an unstructured query.
# We configure it to return the top 3 most relevant documents (k=3).
retriever = vector_db.as_retriever(search_kwargs={"k": 3})
print("Retriever created.")

# --- 3. Set up the LLM ---
# We'll use OpenAI's gpt-3.5-turbo for generation.
# As of Jan 2026, gpt-3.5-turbo-0125 is a robust and cost-effective choice.
llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.2) # Lower temperature for more factual responses
print("LLM initialized.")

# --- 4. Define the RAG Prompt Template ---
# This template instructs the LLM on how to use the retrieved context.
template = """You are an expert assistant. Use the following context to answer the question.
If you don't know the answer, state that you don't know, do not make up an answer.

Context:
{context}

Question: {question}

Answer:"""
prompt = ChatPromptTemplate.from_template(template)
print("Prompt template created.")

# --- 5. Build the RAG Chain ---
# Langchain's Runnable interface allows us to chain operations together cleanly.
# This chain orchestrates the RAG process:
# 1. Takes the user's question.
# 2. Passes it to the retriever to get context documents.
# 3. Formats the context and question into the prompt template.
# 4. Sends the prompt to the LLM.
# 5. Parses the LLM's output.

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("RAG chain constructed.")

# --- 6. Ask Questions! ---
print("\n--- Let's ask some questions! ---")

questions = [
    "What are the key characteristics of AI agents?",
    "What role do LLMs play in modern AI agents?",
    "What is the capital of France?", # This question is outside our document's scope
    "How does tool use benefit AI agents?",
    "What is the main topic of this document?"
]

for i, q in enumerate(questions):
    print(f"\nQuestion {i+1}: {q}")
    response = rag_chain.invoke(q)
    print(f"Answer: {response}")
    print("-" * 50)

print("\nBasic RAG system demonstration complete!")

Run this script:

python rag_query.py

Observe the output. For questions related to the ai_agents_intro.pdf document, the LLM should provide accurate answers grounded in the text. For the question about the capital of France, it should ideally state that it doesn’t know, demonstrating the “guardrail” we put in the prompt.

Code Explanation:

rag_setup.py:
- PyPDFLoader: Reads text from a PDF.
- RecursiveCharacterTextSplitter: Divides the document into chunks, prioritizing semantic boundaries. chunk_size and chunk_overlap are critical parameters for RAG performance.
- OpenAIEmbeddings: Uses OpenAI’s text-embedding-ada-002 model to convert text chunks into numerical vectors. This is a crucial step for semantic search.
- Chroma.from_documents(...): Creates and populates a ChromaDB instance with your text chunks and their corresponding embeddings. persist_directory ensures your knowledge base is saved.
rag_query.py:
- Chroma(...): Loads the previously created ChromaDB from disk.
- vector_db.as_retriever(search_kwargs={"k": 3}): Converts the vector database into a “retriever” component. When invoked, it will query the DB and return the top k most similar documents.
- ChatOpenAI(...): Initializes the LLM. We use gpt-3.5-turbo-0125 with a low temperature for more focused, less creative responses, which is generally preferred for RAG.
- ChatPromptTemplate.from_template(template): Defines the structure of the prompt sent to the LLM. Notice the {context} and {question} placeholders.
- rag_chain = ...: This is where langchain shines. We’re building a “chain” that defines the flow:
  - {"context": retriever, "question": RunnablePassthrough()}: This dictionary defines the inputs for the next step (the prompt).
    - "context": retriever: Takes the user’s question, passes it to the retriever, and the retrieved documents become the context.
    - "question": RunnablePassthrough(): Simply passes the original user question through as the question.
  - | prompt: Takes the context and question and formats them according to our prompt template.
  - | llm: Sends the formatted prompt to the llm for generation.
  - | StrOutputParser(): Extracts the string content from the LLM’s response.
- rag_chain.invoke(q): Executes the entire chain with the given question q.

Mini-Challenge: Enhance Your RAG System

You’ve built a foundational RAG system! Now, let’s make it a bit more robust.

Challenge: Modify the rag_setup.py script to include multiple PDF documents.

Create another dummy PDF, say company_policy.pdf, with some text about a fictional company policy (e.g., “Our company has a remote-first policy. Employees are encouraged to work from home unless specific team collaboration requires in-office presence.”).
Modify rag_setup.py to load both ai_agents_intro.pdf and company_policy.pdf.
Ensure all documents are chunked and stored in the same chroma_db vector store.
Run rag_setup.py again.
Then, modify rag_query.py to ask questions that require knowledge from both documents (e.g., “What are the characteristics of AI agents and what is our company’s remote work policy?”). Observe how the RAG system handles questions spanning multiple source documents.

Hint: PyPDFLoader can take a list of file paths. Or, you can loop through a list of file paths and loader.load() each, then extend your documents list. Remember to re-run rag_setup.py to rebuild your vector database with the new documents.

What to Observe/Learn:

How easy it is to scale your knowledge base by adding more documents.
The effectiveness of the retriever in pulling relevant information from different sources based on a single query.
The LLM’s ability to synthesize information from various retrieved chunks to answer complex questions.

Common Pitfalls & Troubleshooting in RAG

While powerful, RAG systems aren’t magic. Here are some common issues you might encounter and how to troubleshoot them:

Poor Retrieval Quality (Irrelevant Chunks):
- Problem: The LLM receives irrelevant or insufficient context, leading to generic, incorrect, or “I don’t know” answers even when the information exists.
- Causes:
  - Bad Chunking Strategy: Chunks are too small (losing context) or too large (containing too much noise).
  - Suboptimal Embedding Model: The embedding model doesn’t accurately capture the semantic meaning of your domain-specific text.
  - Insufficient k for Retriever: The retriever isn’t fetching enough documents (k value is too low) to provide comprehensive context.
  - Query-Document Mismatch: The user’s query is phrased very differently from how the information is presented in the documents.
- Troubleshooting:
  - Experiment with chunk_size and chunk_overlap: There’s no one-size-fits-all. Test different values.
  - Evaluate Embedding Models: While text-embedding-ada-002 is good, for highly specialized domains, consider fine-tuned or domain-specific embedding models.
  - Adjust k in retriever.as_retriever(search_kwargs={"k": X}): Increase k to retrieve more context. Be mindful of LLM context window limits.
  - Inspect Retrieved Chunks: Add print statements in your rag_query.py script to see what chunks are actually being retrieved before they hit the LLM. This is the most crucial debugging step.
  - Query Rewriting: For complex queries, you might pre-process the user’s question with another LLM call to rephrase it for better retrieval.
LLM Ignoring Context (“Disregard”):
- Problem: The LLM provides an answer based on its internal knowledge, even when contradicting or ignoring the provided context.
- Causes:
  - Weak Prompt Instructions: The prompt doesn’t clearly instruct the LLM to only use the provided context.
  - Strong LLM Priors: The LLM’s pre-trained knowledge on a topic is very strong, and it defaults to that over the new context.
  - Conflicting Information: The retrieved context might contain conflicting information, confusing the LLM.
- Troubleshooting:
  - Refine Prompt Template: Emphasize “ONLY use the provided context,” “If you don’t know, say you don’t know,” and place the context clearly at the beginning.
  - Adjust LLM Temperature: Lowering the temperature (e.g., to 0.1 or 0.2) can make the LLM more deterministic and less prone to “creativity” or hallucination.
  - Context Quality: Ensure the retrieved chunks are coherent and don’t contain contradictory information.
Cost and Latency Issues:
- Problem: Your RAG system is slow or expensive due to numerous API calls.
- Causes:
  - Too Many k Chunks: Retrieving and sending too many chunks to the LLM increases token usage and latency.
  - Expensive Embedding/LLM Models: Using higher-end models unnecessarily.
  - Inefficient Vector Database: Slow retrieval times from your vector store.
- Troubleshooting:
  - Optimize k: Find the smallest k that provides good performance.
  - Consider Cheaper Models: For embeddings, text-embedding-ada-002 is usually good. For generation, gpt-3.5-turbo is generally more cost-effective than gpt-4 for RAG.
  - Batching Embeddings: When indexing, send multiple chunks to the embedding model in a single API call if the library supports it.
  - Monitor & Cache: Implement caching for frequently asked questions or retrieved documents.

Summary

Congratulations! You’ve successfully navigated the world of Retrieval-Augmented Generation. Here’s a quick recap of what we covered:

RAG’s Purpose: It addresses LLM limitations like hallucination and lack of real-time/domain-specific knowledge by providing external context.
The RAG Workflow: Involves offline knowledge base preparation (loading, chunking, embedding, storing in a vector DB) and online query processing (embedding query, retrieving chunks, augmenting prompt, LLM generation).
Key Components: Source data, chunking, embedding models, vector databases, and LLMs.
Practical Implementation: You set up a RAG system using langchain, chromadb, and openai, demonstrating how to load documents, create a vector store, and query it effectively.
Common Pitfalls: We discussed issues like poor retrieval, LLM disregard for context, and performance concerns, along with strategies to troubleshoot them.

RAG is a foundational technique for building robust and reliable AI applications. As you progress, you’ll encounter more advanced RAG patterns, such as query rewriting, re-ranking retrieved documents, and multi-hop retrieval, but the core principles remain the same.

What’s Next? Having mastered RAG, you’ve equipped your LLMs with external knowledge. But what if your LLM needs to do things? In the next chapter, we’ll dive into Memory and State Management, exploring how AI agents can retain conversation history and learn from past interactions, making them truly conversational and capable of multi-turn reasoning. This is another crucial step towards building truly intelligent and agentic AI systems!

References

LangChain Documentation: The primary resource for building LLM applications and RAG chains.
- https://www.langchain.com/docs/
- https://www.langchain.com/docs/concepts/#retrieval-augmented-generation-rag
ChromaDB Documentation: Learn more about this popular open-source vector database.
- https://docs.trychroma.com/
OpenAI Embeddings Documentation: Details on OpenAI’s embedding models and their usage.
- https://platform.openai.com/docs/guides/embeddings
Retrieval Augmented Generation (RAG): From Theory to Practice (IBM): A good overview of RAG concepts.
- https://www.ibm.com/topics/retrieval-augmented-generation

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.