RAG Pipeline¶
Retrieval-Augmented Generation (RAG) supplements LLM generation with external knowledge retrieved at query time. It addresses hallucination on domain-specific questions and knowledge cutoff limitations without retraining the model.
Key Facts¶
- RAG = retrieve relevant documents + inject into prompt context + generate answer
- RAG for knowledge updates, fine-tuning for behavior/style changes - often combined in production
- Simple RAG works but has structural reliability issues - answers can change between runs and look plausible while being wrong
- Signal-to-noise ratio of retrieved context directly determines output quality
- Even GPT-4 with uploaded PDFs makes RAG-type errors - this is a structural problem, not a framework problem
Pipeline Architecture¶
Indexing Phase (Offline)¶
- Load documents: PDFs, web pages, databases, APIs
- Split into chunks: recursive character, sentence-based, or semantic chunking
- Generate embeddings: OpenAI, BGE, E5, Cohere, or local models
- Store in vector database: Chroma, Pinecone, Qdrant, Weaviate, FAISS, pgvector
Query Phase (Online)¶
- User query arrives
- Embed query using same embedding model as indexing
- Retrieve top-K similar chunks from vector DB (cosine similarity)
- Optionally rerank with cross-encoder
- Augment prompt: retrieved chunks + user query + system instructions
- Generate answer via LLM
- Optionally cite sources
LangChain RAG Chain¶
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# Index
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Query
llm = ChatOpenAI(model="gpt-4")
chain = create_retrieval_chain(
retriever,
create_stuff_documents_chain(llm, prompt)
)
result = chain.invoke({"input": "What is the company revenue?"})
print(result["answer"])
print(result["context"]) # retrieved documents
The Hallucination Problem¶
Root cause experiment: Use a niche domain where the LLM wasn't trained. Without context, the model confidently gives wrong answers. With conflicting sources, answers vary between runs. With a single authoritative source clearly marked, answers become correct and consistent.
Key insight: LLMs trust whatever input they receive. If the model doesn't know the domain, it can't distinguish authoritative from forum-quality sources unless given structural hints (headers indicating source type).
Improvement Strategies¶
Better Retrieval¶
- Hybrid search: combine vector similarity + keyword/BM25 search. Reciprocal Rank Fusion (RRF) to merge results without tuning alpha.
- Query expansion: LLM generates multiple search queries from user question
- HyDE: LLM generates hypothetical answer, embed that for retrieval
- Reranking: after initial retrieval, cross-encoder reranks by fine-grained relevance
- Metadata filtering: filter by date, source, category before similarity search
Better Generation¶
- Source attribution: include chunk source references in answers
- Faithfulness check: verify answer is grounded in retrieved context
- Structured prompting: "Only use provided context. Say 'I don't know' if insufficient."
- Map-reduce for long documents: ask same question per chunk, then synthesize partial answers
Evaluation Metrics¶
| Metric | What It Measures |
|---|---|
| Context Precision | Fraction of retrieved chunks that are relevant |
| Context Recall | Fraction of relevant chunks that were retrieved |
| Faithfulness | Is the answer grounded in context (no hallucination)? |
| Answer Relevancy | Does the answer address the user's question? |
Frameworks: RAGAS (automated RAG evaluation), DeepEval, LangSmith
Production Patterns¶
Router + Specialized Agents¶
LLM router classifies question type, routes to specialized agents with curated knowledge bases for high-accuracy categories. Generic RAG handles the rest (hybrid approach).
Knowledge Base Without Vector Search¶
For structured data or limited question types: prepare data tables/documents manually, load directly into prompt. More reliable than vector search for known categories.
Multi-Index RAG¶
Different document types in different indexes with different chunking strategies. Route queries to appropriate index based on question type.
Gotchas¶
- Simple RAG produces answers that change between runs and are often wrong - don't deploy without evaluation
- Vector search can miss obviously present text that keyword search finds easily (cosine similarity failures)
- Embedding the same text produces slightly different vectors across API calls (non-deterministic)
- "Garbage in, garbage out" - if retrieval returns irrelevant chunks, the LLM will hallucinate from them confidently
- Always verify documents were actually indexed (FlowWise: click "Upsert" button) - without this, RAG returns nothing
- RAG doesn't eliminate hallucination, it reduces it - always validate critical outputs
See Also¶
- chunking strategies - How to split documents for optimal retrieval
- vector databases - Storage and search infrastructure
- embeddings - How text becomes searchable vectors
- production patterns - Advanced RAG patterns for production
- llmops - Evaluating and monitoring RAG quality