Chunking Strategies¶

★★★★★ Intermediate

Chunking is the process of splitting documents into smaller pieces for embedding and retrieval in RAG systems. Chunk quality directly determines retrieval quality, which determines answer quality.

Key Facts¶

Chunk size is the most impactful RAG parameter - too small loses context, too large dilutes relevance
Chunk overlap prevents cutting important context at boundaries
Semantic chunking (split on natural boundaries) preserves meaning better than fixed-size splits
Documents should be cleaned before chunking: remove headers/footers, page numbers, watermarks
Tables should be extracted as structured data, not as text chunks

Chunk Size Guidelines¶

Chunk Size	Use Case	Tradeoff
256-512 chars	Precise Q&A, specific facts	More chunks, more retrieval noise
512-1000 chars	General purpose, balanced	Good default for most use cases
1000-2000 chars	Summarization, broader context	Fewer chunks, may dilute relevance
Full document	Single-document Q&A	Only with large context windows

Patterns¶

Recursive Character Text Splitter¶

Most common approach. Recursively splits by separators until chunks are within size limit:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(documents)

Tries \n\n first (paragraph breaks), falls back to \n, then ., then space, then character-level as last resort.

Token-Based Splitting¶

Split by token count rather than character count. More accurate for LLM context management since LLMs have token limits:

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=500,    # tokens, not chars
    chunk_overlap=50
)

Semantic Chunking¶

Split on natural boundaries (paragraphs, sections, headers) rather than arbitrary counts. Better preserves meaning units.

Hierarchical Chunking¶

Parent chunks (full sections) + child chunks (paragraphs). Retrieve child for precision, return parent for context. Gives both precise matching and sufficient surrounding context.

Chunk Overlap¶

Purpose: prevent cutting important context at chunk boundaries
Typical: 10-20% of chunk size (e.g., 100-200 chars for 1000-char chunks)
Too much: duplicate information, increased storage and cost
Too little: lost context at boundaries

Document Loaders¶

Loader	Format	Notes
PyPDF / PyPDF2	PDF	Simple. Struggles with tables and complex layouts
LlamaParse	PDF, DOCX	Best PDF extraction - handles tables, images, complex layouts. Cloud service
Unstructured	PDF, DOCX, HTML, images	Multi-format, extracts structured elements
BeautifulSoup	HTML	Web scraping, HTML parsing
Firecrawl	Web pages	Crawls websites, converts to clean markdown
Cheerio	HTML (Node.js)	Web scraping for FlowWise/Node.js

Data Preparation Checklist¶

Clean documents: remove headers/footers, page numbers, watermarks
Preserve structure: keep headings, table formatting, list structures
Add metadata: source file, page number, section title, date
Handle tables separately: extract as structured data, not text chunks
Test chunk quality: manually review chunks - do they contain meaningful units?
Verify indexing: confirm documents were actually indexed in vector store

Gotchas¶

The #1 mistake in FlowWise: not clicking "Upsert" after connecting documents - RAG retrieves nothing
PDF extraction quality varies wildly - always inspect extracted text before chunking
Tables in PDFs often turn into garbage text with simple extractors - use LlamaParse or manual extraction
Chunk size should be tuned per use case - 1000 chars is just a starting point
Very small chunks (< 200 chars) often lack enough context for meaningful embedding
Overlapping chunks increase storage costs but significantly reduce boundary-related retrieval failures

Alternative: Character Text Splitter¶

Simpler than RecursiveCharacterTextSplitter - splits on a single separator:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",       # single separator (not a list)
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(docs)

When to use: when you know the document structure uses consistent delimiters. Recursive splitter is better for mixed-format documents.

Practical Splitting Pipeline¶

Real-world document processing often requires multiple passes:

from langchain_community.document_loaders import PyPDFLoader

# 1. Load
loader = PyPDFLoader("handbook.pdf")
pages = loader.load()  # one Document per page

# 2. Split with metadata preservation
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(pages)

# Each chunk inherits page metadata (source, page number)
print(chunks[0].metadata)  # {'source': 'handbook.pdf', 'page': 0}

# 3. Verify before indexing
print(f"Total chunks: {len(chunks)}")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks)/len(chunks):.0f} chars")

Indexing Verification Checklist¶

After indexing, always verify: 1. Document count matches expectations: vectorstore._collection.count() 2. Sample retrieval works: vectorstore.similarity_search("test query") 3. Metadata is preserved: check result.metadata on retrieved docs 4. No empty chunks: filter chunks with len(chunk.page_content.strip()) > 50