Chunking Strategies¶
Chunking is the process of splitting documents into smaller pieces for embedding and retrieval in RAG systems. Chunk quality directly determines retrieval quality, which determines answer quality.
Key Facts¶
- Chunk size is the most impactful RAG parameter - too small loses context, too large dilutes relevance
- Chunk overlap prevents cutting important context at boundaries
- Semantic chunking (split on natural boundaries) preserves meaning better than fixed-size splits
- Documents should be cleaned before chunking: remove headers/footers, page numbers, watermarks
- Tables should be extracted as structured data, not as text chunks
Chunk Size Guidelines¶
| Chunk Size | Use Case | Tradeoff |
|---|---|---|
| 256-512 chars | Precise Q&A, specific facts | More chunks, more retrieval noise |
| 512-1000 chars | General purpose, balanced | Good default for most use cases |
| 1000-2000 chars | Summarization, broader context | Fewer chunks, may dilute relevance |
| Full document | Single-document Q&A | Only with large context windows |
Patterns¶
Recursive Character Text Splitter¶
Most common approach. Recursively splits by separators until chunks are within size limit:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(documents)
Tries \n\n first (paragraph breaks), falls back to \n, then ., then space, then character-level as last resort.
Token-Based Splitting¶
Split by token count rather than character count. More accurate for LLM context management since LLMs have token limits:
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=500, # tokens, not chars
chunk_overlap=50
)
Semantic Chunking¶
Split on natural boundaries (paragraphs, sections, headers) rather than arbitrary counts. Better preserves meaning units.
Hierarchical Chunking¶
Parent chunks (full sections) + child chunks (paragraphs). Retrieve child for precision, return parent for context. Gives both precise matching and sufficient surrounding context.
Chunk Overlap¶
- Purpose: prevent cutting important context at chunk boundaries
- Typical: 10-20% of chunk size (e.g., 100-200 chars for 1000-char chunks)
- Too much: duplicate information, increased storage and cost
- Too little: lost context at boundaries
Document Loaders¶
| Loader | Format | Notes |
|---|---|---|
| PyPDF / PyPDF2 | Simple. Struggles with tables and complex layouts | |
| LlamaParse | PDF, DOCX | Best PDF extraction - handles tables, images, complex layouts. Cloud service |
| Unstructured | PDF, DOCX, HTML, images | Multi-format, extracts structured elements |
| BeautifulSoup | HTML | Web scraping, HTML parsing |
| Firecrawl | Web pages | Crawls websites, converts to clean markdown |
| Cheerio | HTML (Node.js) | Web scraping for FlowWise/Node.js |
Data Preparation Checklist¶
- Clean documents: remove headers/footers, page numbers, watermarks
- Preserve structure: keep headings, table formatting, list structures
- Add metadata: source file, page number, section title, date
- Handle tables separately: extract as structured data, not text chunks
- Test chunk quality: manually review chunks - do they contain meaningful units?
- Verify indexing: confirm documents were actually indexed in vector store
Gotchas¶
- The #1 mistake in FlowWise: not clicking "Upsert" after connecting documents - RAG retrieves nothing
- PDF extraction quality varies wildly - always inspect extracted text before chunking
- Tables in PDFs often turn into garbage text with simple extractors - use LlamaParse or manual extraction
- Chunk size should be tuned per use case - 1000 chars is just a starting point
- Very small chunks (< 200 chars) often lack enough context for meaningful embedding
- Overlapping chunks increase storage costs but significantly reduce boundary-related retrieval failures
Alternative: Character Text Splitter¶
Simpler than RecursiveCharacterTextSplitter - splits on a single separator:
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
separator="\n", # single separator (not a list)
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
When to use: when you know the document structure uses consistent delimiters. Recursive splitter is better for mixed-format documents.
Practical Splitting Pipeline¶
Real-world document processing often requires multiple passes:
from langchain_community.document_loaders import PyPDFLoader
# 1. Load
loader = PyPDFLoader("handbook.pdf")
pages = loader.load() # one Document per page
# 2. Split with metadata preservation
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(pages)
# Each chunk inherits page metadata (source, page number)
print(chunks[0].metadata) # {'source': 'handbook.pdf', 'page': 0}
# 3. Verify before indexing
print(f"Total chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks)/len(chunks):.0f} chars")
Indexing Verification Checklist¶
After indexing, always verify: 1. Document count matches expectations: vectorstore._collection.count() 2. Sample retrieval works: vectorstore.similarity_search("test query") 3. Metadata is preserved: check result.metadata on retrieved docs 4. No empty chunks: filter chunks with len(chunk.page_content.strip()) > 50
See Also¶
- rag pipeline - How chunks flow through the RAG system
- vector databases - Where chunks are stored and searched
- embeddings - How chunks become searchable vectors
- production patterns - Knowledge base patterns that bypass chunking