RAG Pipeline¶

★★★★★ Intermediate

Retrieval-Augmented Generation (RAG) supplements LLM generation with external knowledge retrieved at query time. It addresses hallucination on domain-specific questions and knowledge cutoff limitations without retraining the model.

Key Facts¶

RAG = retrieve relevant documents + inject into prompt context + generate answer
RAG for knowledge updates, fine-tuning for behavior/style changes - often combined in production
Simple RAG works but has structural reliability issues - answers can change between runs and look plausible while being wrong
Signal-to-noise ratio of retrieved context directly determines output quality
Even GPT-4 with uploaded PDFs makes RAG-type errors - this is a structural problem, not a framework problem

Pipeline Architecture¶

Indexing Phase (Offline)¶

Load documents: PDFs, web pages, databases, APIs
Split into chunks: recursive character, sentence-based, or semantic chunking
Generate embeddings: OpenAI, BGE, E5, Cohere, or local models
Store in vector database: Chroma, Pinecone, Qdrant, Weaviate, FAISS, pgvector

Query Phase (Online)¶

User query arrives
Embed query using same embedding model as indexing
Retrieve top-K similar chunks from vector DB (cosine similarity)
Optionally rerank with cross-encoder
Augment prompt: retrieved chunks + user query + system instructions
Generate answer via LLM
Optionally cite sources

LangChain RAG Chain¶

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Index
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Query
llm = ChatOpenAI(model="gpt-4")
chain = create_retrieval_chain(
    retriever,
    create_stuff_documents_chain(llm, prompt)
)
result = chain.invoke({"input": "What is the company revenue?"})
print(result["answer"])
print(result["context"])  # retrieved documents

The Hallucination Problem¶

Root cause experiment: Use a niche domain where the LLM wasn't trained. Without context, the model confidently gives wrong answers. With conflicting sources, answers vary between runs. With a single authoritative source clearly marked, answers become correct and consistent.

Key insight: LLMs trust whatever input they receive. If the model doesn't know the domain, it can't distinguish authoritative from forum-quality sources unless given structural hints (headers indicating source type).

Improvement Strategies¶

Better Retrieval¶

Hybrid search: combine vector similarity + keyword/BM25 search. Reciprocal Rank Fusion (RRF) to merge results without tuning alpha.
Query expansion: LLM generates multiple search queries from user question
HyDE: LLM generates hypothetical answer, embed that for retrieval
Reranking: after initial retrieval, cross-encoder reranks by fine-grained relevance
Metadata filtering: filter by date, source, category before similarity search

Better Generation¶

Source attribution: include chunk source references in answers
Faithfulness check: verify answer is grounded in retrieved context
Structured prompting: "Only use provided context. Say 'I don't know' if insufficient."
Map-reduce for long documents: ask same question per chunk, then synthesize partial answers

Evaluation Metrics¶

Metric	What It Measures
Context Precision	Fraction of retrieved chunks that are relevant
Context Recall	Fraction of relevant chunks that were retrieved
Faithfulness	Is the answer grounded in context (no hallucination)?
Answer Relevancy	Does the answer address the user's question?

Frameworks: RAGAS (automated RAG evaluation), DeepEval, LangSmith

Production Patterns¶

Router + Specialized Agents¶

LLM router classifies question type, routes to specialized agents with curated knowledge bases for high-accuracy categories. Generic RAG handles the rest (hybrid approach).

Knowledge Base Without Vector Search¶

For structured data or limited question types: prepare data tables/documents manually, load directly into prompt. More reliable than vector search for known categories.

DIY RAG (Keyword Matching)¶

The simplest possible RAG - string matching to inject relevant context:

# Load documents into a dict: {"Lancaster": "Avery Lancaster, CEO...", "HomeElm": "..."}
context = {}
for f in os.listdir("knowledge_base/employees"):
    name = f.replace(".md", "").split("_")[-1]
    context[name] = open(f"knowledge_base/employees/{f}").read()

def get_relevant_context(message: str) -> list[str]:
    return [details for title, details in context.items()
            if title.lower() in message.lower()]

def add_context(message: str) -> str:
    relevant = get_relevant_context(message)
    if relevant:
        context_str = "\n\n".join(relevant)
        return f"{message}\n\nThe following context may help:\n{context_str}"
    return message

Why this fails in production: - Case-sensitive: "lancaster" vs "Lancaster" breaks matching - Requires exact keyword: "Avery" alone won't match if keyed by surname - No semantic understanding: "Who founded the company?" returns nothing - Fragile at scale: adding new document types requires new matching logic

This approach demonstrates why vector search exists - it replaces fragile string matching with semantic similarity. Use it only for prototyping or when the question space is fully known and small.

Strong system prompts reduce hallucination in keyword RAG:

system = """You are an expert at answering questions about InsureElm.
Give brief, accurate answers.
If you don't know the answer, say so.
Do NOT make anything up if you lack relevant context."""

Multi-Index RAG¶

Different document types in different indexes with different chunking strategies. Route queries to appropriate index based on question type.

LangChain RAG Abstractions¶

LangChain provides three key abstractions that compose into a conversational RAG pipeline:

LLM Abstraction¶

Wraps any language model (OpenAI, Anthropic, local) behind a unified interface:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)

Retriever Abstraction¶

Interface for anything that returns documents given a query. Vector stores expose .as_retriever():

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)
# retriever.invoke("query") returns list of Document objects

Memory Abstraction¶

Maintains conversation history for multi-turn RAG chat:

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# ConversationalRetrievalChain combines all three
from langchain.chains import ConversationalRetrievalChain
chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory
)
result = chain.invoke({"question": "What is the revenue?"})

Document Indexing Pipeline¶

Full document-to-answer pipeline in LangChain:

Load: DirectoryLoader, PyPDFLoader, WebBaseLoader
Split: RecursiveCharacterTextSplitter, CharacterTextSplitter
Embed: OpenAIEmbeddings, OllamaEmbeddings
Store: Chroma.from_documents(), FAISS.from_documents()
Retrieve + Generate: chain with retriever + LLM

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all PDFs from directory
loader = DirectoryLoader('./docs/', glob="**/*.pdf")
docs = loader.load()

# Split with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(docs)

# Index
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

# Manage: add, update, delete documents
vectorstore.add_documents(new_chunks)
vectorstore.delete(ids=["doc_id_1", "doc_id_2"])

Vector Store Document Management¶

After initial indexing, vector stores need ongoing management: - Add: vectorstore.add_documents(new_docs) - incremental indexing - Delete: vectorstore.delete(ids=[...]) - remove outdated docs - Update: delete old version + add new version (no in-place update) - Inspect: vectorstore.get() to view stored documents and metadata - Similarity search with score: vectorstore.similarity_search_with_score(query) returns (doc, score) tuples

Gotchas¶

Simple RAG produces answers that change between runs and are often wrong - don't deploy without evaluation
Vector search can miss obviously present text that keyword search finds easily (cosine similarity failures)
Embedding the same text produces slightly different vectors across API calls (non-deterministic)
"Garbage in, garbage out" - if retrieval returns irrelevant chunks, the LLM will hallucinate from them confidently
Always verify documents were actually indexed (FlowWise: click "Upsert" button) - without this, RAG returns nothing
RAG doesn't eliminate hallucination, it reduces it - always validate critical outputs