Chunking is arguably the most critical factor for RAG performance. How you split your documents affects your system’s ability to find relevant information and generate accurate answers. When a RAG system performs poorly, the issue is often not the retriever—it’s the chunks.
Even a perfect retrieval system fails if it searches over poorly prepared data. In 2025, chunking strategies have evolved from simple fixed-size splitting to sophisticated AI-driven approaches that preserve context and meaning.
This guide explores 8 production-ready chunking strategies, when to use each one, and how to implement them with LangChain and LlamaIndex.
Why Chunking Matters for RAG
Large language models have context window limits (typically 4K-128K tokens). You can’t feed entire documents to embedding models or retrievers efficiently. Chunking solves three critical problems:
Retrieval Precision: Smaller chunks enable more precise semantic matching. A 200-token chunk about “Python async/await” will rank higher for that query than a 5,000-token chapter about “Python concurrency.”
Context Preservation: Good chunking maintains semantic boundaries. Breaking mid-sentence destroys meaning. A chunk starting with “This approach reduces latency by 40%” is useless without knowing which approach.
Computational Efficiency: Embedding models process chunks faster than full documents. Smaller chunks mean lower latency and costs.
NVIDIA’s 2024 benchmark tested seven chunking strategies across five datasets. The results revealed that optimal chunk size varies by content type and query pattern. Financial documents performed best with 1,024-token chunks (57.9% accuracy), while knowledge graphs preferred page-level chunking (52% accuracy).
The key insight: there is no universal chunking strategy. Your choice depends on document structure, query types, and retrieval requirements.
The 8 Chunking Strategies You Need to Know
1. Fixed-Size Chunking
How It Works: Splits text by token or character count, regardless of content boundaries.
Complexity: Low (1/5 dots)
Best For: Simple documents where speed matters more than perfect context. Meeting notes, short blog posts, emails, simple FAQs.
Implementation (LangChain):
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separator="\n"
)
chunks = text_splitter.split_documents(documents)
Implementation (LlamaIndex):
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=50
)
nodes = splitter.get_nodes_from_documents(documents)
Optimal Chunk Sizes:
- 128-256 tokens: Precise fact-based queries, high retrieval precision
- 512-1024 tokens: Complex reasoning, better context retention
- 1024+ tokens: Analytical queries requiring broad context
Pros:
- Fast and predictable
- Easy to implement
- Low computational overhead
- Consistent chunk sizes
Cons:
- Breaks semantic boundaries (mid-sentence, mid-paragraph)
- Loses context at chunk boundaries
- Poor performance on structured documents
When to Use: Prototyping, homogeneous content (all meeting notes, all emails), when speed is critical, or when document structure is minimal.
2. Recursive Chunking
How It Works: Attempts multiple separators in order of priority (\n\n → \n → . → space) until chunks fit the target size.
Complexity: Low-Medium (2/5 dots)
Best For: Documents with structure that should be preserved. Research articles, product guides, short reports.
Implementation (LangChain):
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
How the Algorithm Works:
1. Try splitting by double newline (paragraphs) 2. If chunks are still too large, try single newline (sentences) 3. If still too large, try period + space (sentence boundaries) 4. If still too large, try spaces (word boundaries) 5. Last resort: split by character
This preserves natural text boundaries while respecting size limits.
Pros:
- Respects natural text structure
- Better context preservation than fixed-size
- Still fast and predictable
- Works well with most content
Cons:
- Doesn’t understand semantic meaning
- May still break in awkward places
- No awareness of topics or themes
When to Use: Most general-purpose RAG applications. This is LangChain’s recommended default for generic text. Use when documents have basic structure but you need speed.
3. Document-Based Chunking
How It Works: Splits only at document structure boundaries (headers, sections, paragraphs).
Complexity: Low (1/5 dots for Markdown/HTML, 3/5 for custom formats)
Best For: Collections of short, standalone documents or highly structured files. News articles, customer support tickets, Markdown files.
Implementation for Markdown (LangChain):
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = markdown_splitter.split_text(markdown_text)
Implementation for HTML:
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = html_splitter.split_text(html_text)
Pros:
- Preserves document structure completely
- Excellent for well-formatted content
- Each chunk is a complete logical unit
- Metadata automatically extracted from headers
Cons:
- Highly variable chunk sizes
- Some chunks may be too large or too small
- Requires structured input (Markdown, HTML, etc.)
- Doesn’t work well with plain text
When to Use: Documentation sites, knowledge bases with consistent formatting, content management systems, or when document structure maps perfectly to semantic boundaries.
4. Semantic Chunking
How It Works: Analyzes semantic similarity between sentences using embeddings. Starts new chunk when similarity drops below threshold.
Complexity: Medium (3/5 dots)
Best For: Technical documents, academic papers, narrative content where topics shift without clear separators. Scientific papers, textbooks, novels, whitepapers.
Implementation (LangChain):
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = text_splitter.split_documents(documents)
Implementation (LlamaIndex):
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=OpenAIEmbedding()
)
nodes = splitter.get_nodes_from_documents(documents)
How the Algorithm Works:
1. Split text into sentences 2. Embed each sentence using an embedding model 3. Calculate cosine similarity between adjacent sentences 4. When similarity drops below threshold (e.g., 95th percentile), create new chunk 5. Merge sentences above threshold into same chunk
Example:
- Sentence 1: “Neural networks are inspired by biological neurons.” (Topic: Neural networks)
- Sentence 2: “Each layer transforms input data through weighted connections.” (Topic: Neural networks) → High similarity, same chunk
- Sentence 3: “Python’s asyncio library handles concurrent operations.” (Topic: Python) → Low similarity, new chunk
Pros:
- Chunks align with topic boundaries
- Preserves semantic coherence
- Better context for complex documents
- Reduces hallucinations from fragmented context
Cons:
- Expensive (embeddings for every sentence)
- Slower than fixed-size chunking
- Variable chunk sizes
- Threshold tuning required per domain
Cost Considerations: For a 10,000-word document (~700 sentences), semantic chunking requires 700 embedding calls. With OpenAI’s text-embedding-3-small ($0.02/1M tokens), this costs ~$0.003 per document. For 100,000 documents, that’s $300 in embedding costs just for chunking.
When to Use: High-value documents where accuracy justifies cost (legal contracts, research papers, compliance documents), or when topics shift without headers (novels, transcripts, unstructured reports).
5. LLM-Based Chunking
How It Works: Uses a language model to decide chunk boundaries based on context and meaning.
Complexity: High (4/5 dots)
Best For: Complex text where meaning-aware chunking improves downstream tasks. Long reports, legal opinions, medical records.
Implementation (OpenAI):
from openai import OpenAI
client = OpenAI()
def llm_chunk(text, max_chunk_size=1000):
prompt = f"""Split the following text into semantic chunks.
Each chunk should:
- Be a complete thought or topic
- Not exceed {max_chunk_size} characters
- Break at natural boundaries
Return chunk boundaries as line numbers.
Text:
{text}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
# Parse response to get chunks
return parse_chunks(response.choices[0].message.content)
Pros:
- Most intelligent chunking
- Understands nuance and context
- Can follow complex instructions (e.g., “preserve code blocks”)
- Adapts to document type
Cons:
- Extremely expensive (LLM API costs for every document)
- Very slow (LLM inference latency, especially for large documents)
- Requires LLM access (API dependency)
- Unpredictable chunk sizes
- Limited production use (cost/latency prohibitive at scale)
Cost Analysis: GPT-4o-mini costs $0.15/1M input tokens. For a 10,000-word document (~13K tokens), each chunking operation costs ~$0.002. For 100,000 documents, that’s $200. Add response tokens and the cost doubles.
When to Use: Research prototypes, one-time processing of critical documents, or when you have budget for quality. Not recommended for production RAG systems due to cost and latency.
6. Agentic Chunking
How It Works: An AI agent analyzes document characteristics and selects the optimal chunking strategy for each document or section.
Complexity: Highest (5/5 dots)
Best For: Complex, nuanced documents that require custom strategies. Regulatory filings, multi-section contracts, corporate policies.
Conceptual Implementation:
from openai import OpenAI
client = OpenAI()
def agentic_chunk(document):
# Agent analyzes document
analysis_prompt = f"""Analyze this document and recommend the best chunking strategy:
- Fixed-size: Simple, uniform content
- Semantic: Topics shift without headers
- Document-based: Well-structured with headers
- Hierarchical: Multi-level structure
Document preview:
{document[:2000]}
Return: strategy name and parameters"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": analysis_prompt}]
)
strategy = parse_strategy(response.choices[0].message.content)
# Apply selected strategy
if strategy == "semantic":
return semantic_chunk(document)
elif strategy == "document":
return document_chunk(document)
else:
return fixed_chunk(document)
Real-World Example: A financial report might get:
- Page-level chunking for financial tables
- Semantic chunking for management discussion
- Hierarchical chunking for notes to financial statements
The agent decides based on document type, structure, and content density.
Pros:
- Optimal strategy per document
- Handles heterogeneous content
- Maximum accuracy potential
- Adapts to edge cases
Cons:
- Most expensive approach (LLM calls + strategy execution)
- Slowest (analysis + chunking)
- Complex implementation
- Overkill for most use cases
When to Use: Enterprise knowledge bases with diverse document types, compliance/legal systems where accuracy justifies cost, or research projects exploring state-of-the-art RAG.
Research Note: A 2025 study on Recursive Semantic Chunking found that agentic chunking was discontinued in experiments due to “high computational overhead.” The paper states: “Despite its inefficiencies, Agentic Chunking may become viable in the future as LLMs improve in speed and affordability.”
7. Late Chunking
How It Works: Embeds the entire document first, then derives chunk embeddings from the full-context embeddings. This preserves contextual information that traditional chunk-then-embed approaches lose.
Complexity: Medium (3/5 dots)
Best For: Use cases where chunks need awareness of full document context. Case studies, comprehensive manuals, long-form analysis reports.
How Traditional Chunking Loses Context:
Traditional approach: 1. Split document into chunks 2. Embed each chunk independently 3. Result: Each chunk embedding has no context from other chunks
Late chunking approach: 1. Embed entire document (all tokens) 2. Apply mean pooling to token embeddings within chunk boundaries 3. Result: Each chunk embedding includes full document context
Implementation (Jina AI Embeddings + Milvus):
from transformers import AutoModel
import torch
# Load long-context embedding model
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
def late_chunk(text, chunk_boundaries):
# 1. Embed entire document
inputs = tokenizer(text, return_tensors='pt', truncation=False)
with torch.no_grad():
outputs = model(**inputs)
token_embeddings = outputs.last_hidden_state[0] # Shape: (num_tokens, embedding_dim)
# 2. Apply chunking to embeddings
chunk_embeddings = []
for start, end in chunk_boundaries:
# Mean pooling over tokens in this chunk
chunk_emb = token_embeddings[start:end].mean(dim=0)
chunk_embeddings.append(chunk_emb)
return chunk_embeddings
Example Benefit:
Document: “Quantum entanglement is a key concept in quantum physics. This phenomenon allows particles to be correlated. It has applications in quantum computing.”
Traditional chunking:
- Chunk 1 embedding: “Quantum entanglement is a key concept” (no context about applications)
- Chunk 2 embedding: “This phenomenon allows particles” (doesn’t know “this” refers to entanglement)
- Chunk 3 embedding: “It has applications in quantum computing” (doesn’t know “it” = entanglement)
Late chunking:
- All chunks have full document context
- Chunk 2 knows “this phenomenon” = “quantum entanglement”
- Chunk 3 knows “it” = “quantum entanglement”
Pros:
- Preserves full document context in every chunk
- Reduces hallucinations from isolated fragments
- Better handling of pronouns and references
- Improved retrieval accuracy (10-15% in benchmarks)
Cons:
- Requires long-context embedding model (8K+ tokens)
- Cannot handle documents exceeding model’s context window
- More complex implementation
- Slightly higher latency than traditional chunking
Performance: A 2025 study showed late chunking improved retrieval accuracy by 12-18% on documents with heavy cross-references (legal contracts, technical manuals).
When to Use: Documents with heavy cross-references, pronouns, or where understanding full context improves retrieval. Examples: case studies, comprehensive reports, long-form analysis.
Models Supporting Late Chunking:
- Jina AI:
jina-embeddings-v2-base-en(8,192 tokens) - Nomic:
nomic-embed-text-v1.5(8,192 tokens) - OpenAI:
text-embedding-3-large(8,191 tokens)
8. Hierarchical Chunking
How It Works: Breaks text into multiple levels (sections → paragraphs → sentences). Creates parent-child relationships.
Complexity: Medium (3/5 dots)
Best For: Large, structured documents where both summary and detail are needed. Employee handbooks, government regulations, software documentation.
Implementation (LlamaIndex):
from llama_index.core.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128], # Parent → Child → Grandchild
chunk_overlap=20
)
nodes = parser.get_nodes_from_documents(documents)
How It Works:
Level 1 (Parent): 2048 tokens (entire section)
- “Section 3: Security Policies” (full section)
Level 2 (Child): 512 tokens (paragraph)
- “3.1 Password Requirements: Passwords must be at least 12 characters…”
Level 3 (Grandchild): 128 tokens (detail)
- “Passwords must include uppercase, lowercase, numbers, and symbols”
Retrieval Strategy: 1. Search at granular level (128 tokens) for precision 2. Retrieve parent chunks (512-2048 tokens) for context 3. LLM receives both specific answer + surrounding context
Implementation (Manual Approach):
from langchain_text_splitters import RecursiveCharacterTextSplitter
def hierarchical_chunk(document):
# Level 1: Large chunks (sections)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2048,
chunk_overlap=20
)
parent_chunks = parent_splitter.split_documents([document])
# Level 2: Medium chunks (paragraphs)
child_chunks = []
for parent in parent_chunks:
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=20
)
children = child_splitter.split_documents([parent])
for child in children:
child.metadata['parent_id'] = parent.metadata['id']
child_chunks.extend(children)
return parent_chunks, child_chunks
Pros:
- Balances precision and context
- Retrieve specific details, expand to full context
- Works well for multi-level documents
- Reduces irrelevant context
Cons:
- More complex implementation
- Requires careful size tuning
- Higher storage (multiple chunk sizes)
- Retrieval logic more complex
When to Use: Documentation sites (search specific API → retrieve full section), legal/compliance (find regulation → get full context), or when queries vary between high-level and detail-oriented.
Choosing the Right Strategy: Decision Framework
By Content Type
Simple, Unstructured Text (emails, chat logs, social media):
- Use: Fixed-size chunking
- Size: 256-512 tokens
- Why: No structure to preserve, speed matters
Structured Documents (Markdown, HTML, documentation):
- Use: Document-based chunking
- Size: Natural boundaries (headers)
- Why: Structure maps to semantic boundaries
Complex Narrative (research papers, articles, reports):
- Use: Semantic chunking or Recursive
- Size: 512-1024 tokens
- Why: Topics shift without clear headers
Legal/Technical (contracts, medical records, patents):
- Use: LLM-based or Hierarchical
- Size: Variable or multi-level (2048/512/128)
- Why: Meaning-aware boundaries critical
Mixed Content (knowledge base with diverse formats):
- Use: Agentic chunking
- Why: Different strategies for different documents
By Query Pattern
Factoid Queries (“What is X?”, “When did Y happen?”):
- Optimal: 128-256 tokens
- Strategy: Fixed-size or Semantic
- Why: Precise matching more important than context
Analytical Queries (“How does X compare to Y?”, “Why did Z fail?”):
- Optimal: 1024+ tokens
- Strategy: Hierarchical or Semantic
- Why: Need broader context for reasoning
Mixed Queries (both factoid and analytical):
- Use: Hierarchical chunking (retrieve small, return large)
By Performance Requirements
Speed Critical (real-time chatbot, high QPS):
- Use: Fixed-size or Recursive
- Why: No embedding/LLM overhead
Accuracy Critical (legal, medical, compliance):
- Use: Semantic, LLM-based, or Agentic
- Why: Quality justifies cost
Balanced (most production RAG):
- Use: Recursive chunking (LangChain default)
- Why: Good context preservation, still fast
By Budget
Low Budget:
- Fixed-size → Recursive → Document-based
- Avoid: Semantic, LLM-based, Agentic
Medium Budget:
- Recursive → Semantic → Hierarchical
- Use semantic chunking selectively (high-value docs)
High Budget:
- Any strategy
- Benchmark and optimize per document type
Chunk Size Optimization: The Data
NVIDIA’s 2024 Benchmark Results
Tested across 5 datasets (FinanceBench, Earnings, KG-RAG, RAGBattlePacket, RAGChallenge):
Page-Level Chunking: 0.648 accuracy, 0.107 std dev (most consistent)
Token-Based Results:
- 128 tokens: 0.421 accuracy (worst on KG-RAG)
- 256 tokens: ~0.55 accuracy (good for factoid)
- 512 tokens: 0.681 accuracy (best for Earnings dataset)
- 1024 tokens: 0.579-0.804 accuracy (best for FinanceBench, RAGBattlePacket)
- 2048 tokens: 0.506-0.749 accuracy (underperformed 1024 on most datasets)
Key Finding: Extreme chunk sizes (very small or very large) underperformed. The “sweet spot” is 512-1024 tokens for most content.
Chunk Size by Use Case
128-256 tokens:
- Medical fact lookup: “What is the dosage for drug X?”
- Quick reference: “What’s the keyboard shortcut for Y?”
- Definitions: “Define technical term Z”
- 14.5% precision improvement with 64-token overlap (Reddit study)
512 tokens:
- Financial earnings reports: 68.1% accuracy (NVIDIA)
- Product documentation
- FAQ responses
1024 tokens:
- Financial analysis: 57.9% accuracy on FinanceBench
- Technical guides: 80.4% accuracy on RAGBattlePacket
- Research summaries
2048+ tokens:
- Long-form analysis
- Comparative reports
- When answer requires broad context
The Overlap Question
Chunk overlap maintains context between adjacent chunks. Common values: 10-20% of chunk size.
Examples:
- 512 tokens → 50-100 token overlap
- 1024 tokens → 100-200 token overlap
Impact: Reddit study showed adding 64-token overlap improved dense retrieval precision by 14.5% (0.173 → 0.198).
Trade-off: More overlap = more chunks = higher storage + costs. Find balance through experimentation.
Implementation Guide: LangChain vs LlamaIndex
LangChain Splitters Reference
Best for most use cases: RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
For code: RecursiveCharacterTextSplitter.from_language()
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=500,
chunk_overlap=50
)
Supported languages: Python, JavaScript, TypeScript, Java, C++, Go, Rust, Ruby, PHP, Swift, Kotlin, C#, SQL, HTML, Markdown, LaTeX.
For structured content:
MarkdownHeaderTextSplitter: Split by Markdown headersHTMLHeaderTextSplitter: Split by HTML headersJSONTextSplitter: Split JSON while preserving structure
For tokens: TokenTextSplitter
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
encoding_name="cl100k_base", # GPT-4 tokenizer
chunk_size=100,
chunk_overlap=0
)
For semantic: SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
LlamaIndex Node Parsers Reference
Best for most use cases: SentenceSplitter
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20
)
For semantic: SemanticSplitterNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
parser = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=OpenAIEmbedding()
)
For hierarchical: HierarchicalNodeParser
from llama_index.core.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
For code: Use CodeSplitter with language awareness
For context enrichment: SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceWindowNodeParser
parser = SentenceWindowNodeParser(
window_size=2, # ±2 sentences of context
window_metadata_key="window",
original_text_metadata_key="original_text"
)
This indexes at sentence granularity but retrieves ±2 surrounding sentences for context.
Advanced Techniques for 2025
1. Contextual Retrieval (Anthropic)
Prepend LLM-generated context to each chunk before embedding.
How it works:
def add_context_to_chunk(chunk, full_document):
prompt = f"""Document:
{full_document}
Chunk:
{chunk}
Provide brief context (2-3 sentences) explaining this chunk's role in the document."""
context = llm.generate(prompt)
return f"{context}\n\n{chunk}"
Benefit: Chunks become self-contained. Instead of “It reduces latency by 40%”, you get “Context: This refers to the async/await optimization technique. It reduces latency by 40%.”
Cost: LLM call per chunk (expensive for large corpora).
2. Contextual Embeddings
Embed [document_summary] + [chunk] instead of just [chunk].
def contextual_embed(chunk, document_summary):
text_to_embed = f"Document: {document_summary}\n\nChunk: {chunk}"
return embedding_model.embed(text_to_embed)
Benefit: Query “What’s the recommendation?” can match chunk even if chunk doesn’t contain “recommendation” keyword, because document summary does.
3. Metadata Augmentation (LlamaIndex Pattern)
Add synthetic Q&A pairs or titles to chunk metadata.
def augment_metadata(chunk):
prompt = f"""Generate 3 questions this chunk can answer:
{chunk}"""
questions = llm.generate(prompt)
return {
"text": chunk,
"metadata": {
"example_questions": questions,
"auto_title": generate_title(chunk)
}
}
Benefit: Retrieval can match against questions in metadata, improving recall.
4. Hybrid Chunking
Combine strategies for different sections.
def hybrid_chunk(document):
# Detect document structure
if has_code_blocks(document):
code_chunks = code_splitter.split(extract_code(document))
if has_tables(document):
table_chunks = table_splitter.split(extract_tables(document))
# Regular text
text_chunks = recursive_splitter.split(extract_text(document))
return code_chunks + table_chunks + text_chunks
Use case: Technical documentation with code + explanations.
Common Pitfalls and How to Avoid Them
Pitfall 1: Using Default Chunk Sizes Blindly
Problem: LangChain default is 1000 characters. Your documents might need 256 tokens or 2048 tokens.
Solution: Benchmark on your data. Test 256, 512, 1024, 2048. Measure retrieval accuracy.
Pitfall 2: Ignoring Chunk Overlap
Problem: Zero overlap creates hard boundaries. Context at chunk edges gets lost.
Solution: Use 10-20% overlap. For 512 tokens, use 50-100 overlap.
Pitfall 3: Inconsistent Chunk Sizes
Problem: Fixed-size works great until you hit a 5-word paragraph that becomes its own chunk.
Solution: Combine strategies. Use document-based for structure, then recursive for oversized chunks.
Pitfall 4: Not Considering Query Types
Problem: Using 2048-token chunks for “What is X?” queries (too much irrelevant context).
Solution: Match chunk size to query complexity. Factoid = small, analytical = large.
Pitfall 5: Embedding Overhead Ignored
Problem: Semantic chunking costs $300 for 100K documents. Budget only $50.
Solution: Use semantic chunking selectively (critical docs), fixed-size for the rest.
Pitfall 6: Forgetting to Test
Problem: Assume chunking strategy works, then discover 40% accuracy in production.
Solution: Use LlamaIndex’s ResponseEvaluator or LangChain’s RAGAS metrics. Test on real queries before deploying.
Evaluation Framework
How to know if your chunking strategy works?
Metrics to Track
Retrieval Metrics:
- Context Recall: How many relevant chunks were retrieved? (Higher = better)
- Context Precision: How many retrieved chunks are relevant? (Higher = better)
- Mean Reciprocal Rank (MRR): Is the best chunk ranked first?
Generation Metrics:
- Faithfulness: Does the answer match retrieved context? (Checks hallucinations)
- Answer Relevancy: Does the answer address the query?
- Context Relevancy: Is retrieved context useful for the query?
System Metrics:
- Average response time: Chunking + embedding + retrieval + generation
- Chunk count: More chunks = higher storage + slower search
- Cost per query: Embedding + LLM costs
Testing Workflow (LlamaIndex)
from llama_index.core.evaluation import ResponseEvaluator, FaithfulnessEvaluator
from llama_index.core import VectorStoreIndex
# Create index with your chunking strategy
nodes = your_chunking_strategy(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
# Evaluate
queries = ["What is X?", "How does Y work?", "Compare A vs B"]
faithfulness_evaluator = FaithfulnessEvaluator()
response_evaluator = ResponseEvaluator()
for query in queries:
response = query_engine.query(query)
faithfulness = faithfulness_evaluator.evaluate_response(response=response)
relevancy = response_evaluator.evaluate(query=query, response=response)
print(f"Query: {query}")
print(f"Faithfulness: {faithfulness.score}")
print(f"Relevancy: {relevancy.score}")
A/B Testing Chunking Strategies
strategies = {
"fixed_256": fixed_size_splitter(256, 25),
"fixed_512": fixed_size_splitter(512, 50),
"recursive_512": recursive_splitter(512, 50),
"semantic": semantic_splitter(threshold=95)
}
results = {}
for name, splitter in strategies.items():
chunks = splitter.split(documents)
index = create_index(chunks)
# Run evaluation
scores = evaluate(index, test_queries)
results[name] = scores
# Compare results
best_strategy = max(results, key=lambda x: results[x]['accuracy'])
print(f"Best strategy: {best_strategy}")
Production Recommendations
For Prototyping
Start with: RecursiveCharacterTextSplitter (LangChain) or SentenceSplitter (LlamaIndex)
Chunk size: 512 tokens, 50 token overlap
Why: Fast, predictable, works for 80% of use cases. Iterate from here.
For Production RAG Systems
High-volume, cost-sensitive:
- Strategy: Fixed-size or Recursive
- Size: 512 tokens
- Overlap: 50 tokens
Accuracy-critical, moderate volume:
- Strategy: Semantic or Hierarchical
- Size: Variable (semantic) or 2048/512/128 (hierarchical)
- Overlap: 10-20%
Mixed content types:
- Strategy: Document-based + Recursive fallback
- Detect format (Markdown, HTML, plain text)
- Apply appropriate splitter per format
Enterprise knowledge base:
- Strategy: Agentic or Hybrid
- Cost: High, but justified by accuracy
- Use agent to select strategy per document
For Specific Domains
Legal/Compliance:
- Strategy: Hierarchical + LLM-based
- Size: Large (1024-2048 parent, 256-512 child)
- Why: Need full context + precise citations
Customer Support:
- Strategy: Document-based (ticket = chunk)
- Size: Natural boundaries
- Why: Each ticket is self-contained
Code Documentation:
- Strategy: Code-aware recursive
- Size: 500 tokens (preserves functions)
- Language: Specify (Python, JavaScript, etc.)
Scientific Research:
- Strategy: Semantic
- Size: 512-1024 tokens
- Why: Topics shift without clear headers
The Future of Chunking (2025 and Beyond)
Trend 1: Long-Context Models Reduce Chunking Need
GPT-4 Turbo (128K tokens), Claude 3.5 Sonnet (200K tokens), Gemini 1.5 Pro (1M tokens).
Implication: For documents under context limit, you can skip chunking entirely and embed full document.
Reality Check: Even with 200K context, chunking still improves:
- Retrieval precision (smaller chunks = better matching)
- Cost (pay for only relevant chunks, not full 200K tokens)
- Latency (process 5 chunks faster than 1 giant document)
Trend 2: Agentic Chunking Goes Mainstream
As LLM costs drop (GPT-4o-mini is 60× cheaper than GPT-4), agentic chunking becomes viable.
Prediction: By late 2025, production RAG systems will use agents to select chunking strategies dynamically.
Trend 3: Embeddings Become Chunking-Aware
Late chunking and contextual embeddings are early examples. Future embedding models will natively handle:
- Multi-granularity (embed at token, sentence, paragraph levels simultaneously)
- Cross-chunk awareness (embeddings reference neighboring chunks)
- Structure preservation (Markdown headers, code blocks embedded with structural metadata)
Example: Cohere’s Embed v3 (2024) already supports multi-granular embeddings.
Trend 4: Chunking-Free RAG
Research explores skipping chunking via:
- Token-level retrieval: Index individual tokens, retrieve precise spans
- Neural search: Transformer models that search full documents without chunking
- Graph-based RAG: Represent documents as knowledge graphs, retrieve subgraphs
Current state: Experimental. Traditional chunking still dominates production.
Quick Reference Chart
Final Recommendations
Start simple: Use RecursiveCharacterTextSplitter with 512 tokens and 50 overlap. This works for 80% of RAG applications.
Measure, don’t guess: Benchmark on your data with your queries. Chunk size that works for financial reports may fail for customer support tickets.
Match strategy to content: Structured docs → document-based. Narrative content → semantic. Mixed content → agentic or hybrid.
Watch your budget: Semantic chunking costs scale linearly with document count. For 1M documents, that’s thousands in embedding costs.
Consider query patterns: Factoid queries prefer small chunks (256). Analytical queries need large chunks (1024+).
Use overlap: 10-20% overlap prevents context loss at boundaries. 14.5% precision improvement is worth the storage cost.
Test in production: Offline metrics don’t always predict real-world performance. A/B test strategies with real users.
Future-proof: Long-context models (128K+ tokens) reduce chunking need but don’t eliminate it. Chunking still improves precision and reduces costs.
The best chunking strategy is the one that works for your documents, your queries, and your constraints. Start with recursive, measure performance, and iterate.