PROMPT AND CONTEXT ENGINEERING

RAG Text Chunking Strategies

RAG text chunking strategies visual metaphor showing handwritten Chunk It note with red pin on cork board representing document segmentation for retrieval augmented generation systems, semantic search optimization, vector databases, LangChain and LlamaIndex implementations

Chunking is arguably the most critical factor for RAG performance. How you split your documents affects your system’s ability to find relevant information and generate accurate answers. When a RAG system performs poorly, the issue is often not the retriever—it’s the chunks.

Even a perfect retrieval system fails if it searches over poorly prepared data. In 2025, chunking strategies have evolved from simple fixed-size splitting to sophisticated AI-driven approaches that preserve context and meaning.

This guide explores 8 production-ready chunking strategies, when to use each one, and how to implement them with LangChain and LlamaIndex.

Why Chunking Matters for RAG

Large language models have context window limits (typically 4K-128K tokens). You can’t feed entire documents to embedding models or retrievers efficiently. Chunking solves three critical problems:

Retrieval Precision: Smaller chunks enable more precise semantic matching. A 200-token chunk about “Python async/await” will rank higher for that query than a 5,000-token chapter about “Python concurrency.”

Context Preservation: Good chunking maintains semantic boundaries. Breaking mid-sentence destroys meaning. A chunk starting with “This approach reduces latency by 40%” is useless without knowing which approach.

Computational Efficiency: Embedding models process chunks faster than full documents. Smaller chunks mean lower latency and costs.

NVIDIA’s 2024 benchmark tested seven chunking strategies across five datasets. The results revealed that optimal chunk size varies by content type and query pattern. Financial documents performed best with 1,024-token chunks (57.9% accuracy), while knowledge graphs preferred page-level chunking (52% accuracy).

The key insight: there is no universal chunking strategy. Your choice depends on document structure, query types, and retrieval requirements.

The 8 Chunking Strategies You Need to Know

1. Fixed-Size Chunking

How It Works: Splits text by token or character count, regardless of content boundaries.

Complexity: Low (1/5 dots)

Best For: Simple documents where speed matters more than perfect context. Meeting notes, short blog posts, emails, simple FAQs.

Implementation (LangChain):

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separator="\n"
)

chunks = text_splitter.split_documents(documents)

Implementation (LlamaIndex):

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)

nodes = splitter.get_nodes_from_documents(documents)

Optimal Chunk Sizes:

  • 128-256 tokens: Precise fact-based queries, high retrieval precision
  • 512-1024 tokens: Complex reasoning, better context retention
  • 1024+ tokens: Analytical queries requiring broad context

Pros:

  • Fast and predictable
  • Easy to implement
  • Low computational overhead
  • Consistent chunk sizes

Cons:

  • Breaks semantic boundaries (mid-sentence, mid-paragraph)
  • Loses context at chunk boundaries
  • Poor performance on structured documents

When to Use: Prototyping, homogeneous content (all meeting notes, all emails), when speed is critical, or when document structure is minimal.

2. Recursive Chunking

How It Works: Attempts multiple separators in order of priority (\n\n → \n → . → space) until chunks fit the target size.

Complexity: Low-Medium (2/5 dots)

Best For: Documents with structure that should be preserved. Research articles, product guides, short reports.

Implementation (LangChain):

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)

How the Algorithm Works:

1. Try splitting by double newline (paragraphs) 2. If chunks are still too large, try single newline (sentences) 3. If still too large, try period + space (sentence boundaries) 4. If still too large, try spaces (word boundaries) 5. Last resort: split by character

This preserves natural text boundaries while respecting size limits.

Pros:

  • Respects natural text structure
  • Better context preservation than fixed-size
  • Still fast and predictable
  • Works well with most content

Cons:

  • Doesn’t understand semantic meaning
  • May still break in awkward places
  • No awareness of topics or themes

When to Use: Most general-purpose RAG applications. This is LangChain’s recommended default for generic text. Use when documents have basic structure but you need speed.

3. Document-Based Chunking

How It Works: Splits only at document structure boundaries (headers, sections, paragraphs).

Complexity: Low (1/5 dots for Markdown/HTML, 3/5 for custom formats)

Best For: Collections of short, standalone documents or highly structured files. News articles, customer support tickets, Markdown files.

Implementation for Markdown (LangChain):

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = markdown_splitter.split_text(markdown_text)

Implementation for HTML:

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = html_splitter.split_text(html_text)

Pros:

  • Preserves document structure completely
  • Excellent for well-formatted content
  • Each chunk is a complete logical unit
  • Metadata automatically extracted from headers

Cons:

  • Highly variable chunk sizes
  • Some chunks may be too large or too small
  • Requires structured input (Markdown, HTML, etc.)
  • Doesn’t work well with plain text

When to Use: Documentation sites, knowledge bases with consistent formatting, content management systems, or when document structure maps perfectly to semantic boundaries.

4. Semantic Chunking

How It Works: Analyzes semantic similarity between sentences using embeddings. Starts new chunk when similarity drops below threshold.

Complexity: Medium (3/5 dots)

Best For: Technical documents, academic papers, narrative content where topics shift without clear separators. Scientific papers, textbooks, novels, whitepapers.

Implementation (LangChain):

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

chunks = text_splitter.split_documents(documents)

Implementation (LlamaIndex):

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding()
)

nodes = splitter.get_nodes_from_documents(documents)

How the Algorithm Works:

1. Split text into sentences 2. Embed each sentence using an embedding model 3. Calculate cosine similarity between adjacent sentences 4. When similarity drops below threshold (e.g., 95th percentile), create new chunk 5. Merge sentences above threshold into same chunk

Example:

  • Sentence 1: “Neural networks are inspired by biological neurons.” (Topic: Neural networks)
  • Sentence 2: “Each layer transforms input data through weighted connections.” (Topic: Neural networks) → High similarity, same chunk
  • Sentence 3: “Python’s asyncio library handles concurrent operations.” (Topic: Python) → Low similarity, new chunk

Pros:

  • Chunks align with topic boundaries
  • Preserves semantic coherence
  • Better context for complex documents
  • Reduces hallucinations from fragmented context

Cons:

  • Expensive (embeddings for every sentence)
  • Slower than fixed-size chunking
  • Variable chunk sizes
  • Threshold tuning required per domain

Cost Considerations: For a 10,000-word document (~700 sentences), semantic chunking requires 700 embedding calls. With OpenAI’s text-embedding-3-small ($0.02/1M tokens), this costs ~$0.003 per document. For 100,000 documents, that’s $300 in embedding costs just for chunking.

When to Use: High-value documents where accuracy justifies cost (legal contracts, research papers, compliance documents), or when topics shift without headers (novels, transcripts, unstructured reports).

5. LLM-Based Chunking

How It Works: Uses a language model to decide chunk boundaries based on context and meaning.

Complexity: High (4/5 dots)

Best For: Complex text where meaning-aware chunking improves downstream tasks. Long reports, legal opinions, medical records.

Implementation (OpenAI):

from openai import OpenAI

client = OpenAI()

def llm_chunk(text, max_chunk_size=1000):
    prompt = f"""Split the following text into semantic chunks.
    Each chunk should:
    - Be a complete thought or topic
    - Not exceed {max_chunk_size} characters
    - Break at natural boundaries
    
    Return chunk boundaries as line numbers.
    
    Text:
    {text}"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse response to get chunks
    return parse_chunks(response.choices[0].message.content)

Pros:

  • Most intelligent chunking
  • Understands nuance and context
  • Can follow complex instructions (e.g., “preserve code blocks”)
  • Adapts to document type

Cons:

  • Extremely expensive (LLM API costs for every document)
  • Very slow (LLM inference latency, especially for large documents)
  • Requires LLM access (API dependency)
  • Unpredictable chunk sizes
  • Limited production use (cost/latency prohibitive at scale)

Cost Analysis: GPT-4o-mini costs $0.15/1M input tokens. For a 10,000-word document (~13K tokens), each chunking operation costs ~$0.002. For 100,000 documents, that’s $200. Add response tokens and the cost doubles.

When to Use: Research prototypes, one-time processing of critical documents, or when you have budget for quality. Not recommended for production RAG systems due to cost and latency.

6. Agentic Chunking

How It Works: An AI agent analyzes document characteristics and selects the optimal chunking strategy for each document or section.

Complexity: Highest (5/5 dots)

Best For: Complex, nuanced documents that require custom strategies. Regulatory filings, multi-section contracts, corporate policies.

Conceptual Implementation:

from openai import OpenAI

client = OpenAI()

def agentic_chunk(document):
    # Agent analyzes document
    analysis_prompt = f"""Analyze this document and recommend the best chunking strategy:
    - Fixed-size: Simple, uniform content
    - Semantic: Topics shift without headers
    - Document-based: Well-structured with headers
    - Hierarchical: Multi-level structure
    
    Document preview:
    {document[:2000]}
    
    Return: strategy name and parameters"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": analysis_prompt}]
    )
    
    strategy = parse_strategy(response.choices[0].message.content)
    
    # Apply selected strategy
    if strategy == "semantic":
        return semantic_chunk(document)
    elif strategy == "document":
        return document_chunk(document)
    else:
        return fixed_chunk(document)

Real-World Example: A financial report might get:

  • Page-level chunking for financial tables
  • Semantic chunking for management discussion
  • Hierarchical chunking for notes to financial statements

The agent decides based on document type, structure, and content density.

Pros:

  • Optimal strategy per document
  • Handles heterogeneous content
  • Maximum accuracy potential
  • Adapts to edge cases

Cons:

  • Most expensive approach (LLM calls + strategy execution)
  • Slowest (analysis + chunking)
  • Complex implementation
  • Overkill for most use cases

When to Use: Enterprise knowledge bases with diverse document types, compliance/legal systems where accuracy justifies cost, or research projects exploring state-of-the-art RAG.

Research Note: A 2025 study on Recursive Semantic Chunking found that agentic chunking was discontinued in experiments due to “high computational overhead.” The paper states: “Despite its inefficiencies, Agentic Chunking may become viable in the future as LLMs improve in speed and affordability.”

7. Late Chunking

How It Works: Embeds the entire document first, then derives chunk embeddings from the full-context embeddings. This preserves contextual information that traditional chunk-then-embed approaches lose.

Complexity: Medium (3/5 dots)

Best For: Use cases where chunks need awareness of full document context. Case studies, comprehensive manuals, long-form analysis reports.

How Traditional Chunking Loses Context:

Traditional approach: 1. Split document into chunks 2. Embed each chunk independently 3. Result: Each chunk embedding has no context from other chunks

Late chunking approach: 1. Embed entire document (all tokens) 2. Apply mean pooling to token embeddings within chunk boundaries 3. Result: Each chunk embedding includes full document context

Implementation (Jina AI Embeddings + Milvus):

from transformers import AutoModel
import torch

# Load long-context embedding model
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

def late_chunk(text, chunk_boundaries):
    # 1. Embed entire document
    inputs = tokenizer(text, return_tensors='pt', truncation=False)
    with torch.no_grad():
        outputs = model(**inputs)
        token_embeddings = outputs.last_hidden_state[0]  # Shape: (num_tokens, embedding_dim)
    
    # 2. Apply chunking to embeddings
    chunk_embeddings = []
    for start, end in chunk_boundaries:
        # Mean pooling over tokens in this chunk
        chunk_emb = token_embeddings[start:end].mean(dim=0)
        chunk_embeddings.append(chunk_emb)
    
    return chunk_embeddings

Example Benefit:

Document: “Quantum entanglement is a key concept in quantum physics. This phenomenon allows particles to be correlated. It has applications in quantum computing.”

Traditional chunking:

  • Chunk 1 embedding: “Quantum entanglement is a key concept” (no context about applications)
  • Chunk 2 embedding: “This phenomenon allows particles” (doesn’t know “this” refers to entanglement)
  • Chunk 3 embedding: “It has applications in quantum computing” (doesn’t know “it” = entanglement)

Late chunking:

  • All chunks have full document context
  • Chunk 2 knows “this phenomenon” = “quantum entanglement”
  • Chunk 3 knows “it” = “quantum entanglement”

Pros:

  • Preserves full document context in every chunk
  • Reduces hallucinations from isolated fragments
  • Better handling of pronouns and references
  • Improved retrieval accuracy (10-15% in benchmarks)

Cons:

  • Requires long-context embedding model (8K+ tokens)
  • Cannot handle documents exceeding model’s context window
  • More complex implementation
  • Slightly higher latency than traditional chunking

Performance: A 2025 study showed late chunking improved retrieval accuracy by 12-18% on documents with heavy cross-references (legal contracts, technical manuals).

When to Use: Documents with heavy cross-references, pronouns, or where understanding full context improves retrieval. Examples: case studies, comprehensive reports, long-form analysis.

Models Supporting Late Chunking:

  • Jina AI: jina-embeddings-v2-base-en (8,192 tokens)
  • Nomic: nomic-embed-text-v1.5 (8,192 tokens)
  • OpenAI: text-embedding-3-large (8,191 tokens)

8. Hierarchical Chunking

How It Works: Breaks text into multiple levels (sections → paragraphs → sentences). Creates parent-child relationships.

Complexity: Medium (3/5 dots)

Best For: Large, structured documents where both summary and detail are needed. Employee handbooks, government regulations, software documentation.

Implementation (LlamaIndex):

from llama_index.core.node_parser import HierarchicalNodeParser

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128],  # Parent → Child → Grandchild
    chunk_overlap=20
)

nodes = parser.get_nodes_from_documents(documents)

How It Works:

Level 1 (Parent): 2048 tokens (entire section)

  • “Section 3: Security Policies” (full section)

Level 2 (Child): 512 tokens (paragraph)

  • “3.1 Password Requirements: Passwords must be at least 12 characters…”

Level 3 (Grandchild): 128 tokens (detail)

  • “Passwords must include uppercase, lowercase, numbers, and symbols”

Retrieval Strategy: 1. Search at granular level (128 tokens) for precision 2. Retrieve parent chunks (512-2048 tokens) for context 3. LLM receives both specific answer + surrounding context

Implementation (Manual Approach):

from langchain_text_splitters import RecursiveCharacterTextSplitter

def hierarchical_chunk(document):
    # Level 1: Large chunks (sections)
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2048,
        chunk_overlap=20
    )
    parent_chunks = parent_splitter.split_documents([document])
    
    # Level 2: Medium chunks (paragraphs)
    child_chunks = []
    for parent in parent_chunks:
        child_splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=20
        )
        children = child_splitter.split_documents([parent])
        for child in children:
            child.metadata['parent_id'] = parent.metadata['id']
        child_chunks.extend(children)
    
    return parent_chunks, child_chunks

Pros:

  • Balances precision and context
  • Retrieve specific details, expand to full context
  • Works well for multi-level documents
  • Reduces irrelevant context

Cons:

  • More complex implementation
  • Requires careful size tuning
  • Higher storage (multiple chunk sizes)
  • Retrieval logic more complex

When to Use: Documentation sites (search specific API → retrieve full section), legal/compliance (find regulation → get full context), or when queries vary between high-level and detail-oriented.

Choosing the Right Strategy: Decision Framework

By Content Type

Simple, Unstructured Text (emails, chat logs, social media):

  • Use: Fixed-size chunking
  • Size: 256-512 tokens
  • Why: No structure to preserve, speed matters

Structured Documents (Markdown, HTML, documentation):

  • Use: Document-based chunking
  • Size: Natural boundaries (headers)
  • Why: Structure maps to semantic boundaries

Complex Narrative (research papers, articles, reports):

  • Use: Semantic chunking or Recursive
  • Size: 512-1024 tokens
  • Why: Topics shift without clear headers

Legal/Technical (contracts, medical records, patents):

  • Use: LLM-based or Hierarchical
  • Size: Variable or multi-level (2048/512/128)
  • Why: Meaning-aware boundaries critical

Mixed Content (knowledge base with diverse formats):

  • Use: Agentic chunking
  • Why: Different strategies for different documents

By Query Pattern

Factoid Queries (“What is X?”, “When did Y happen?”):

  • Optimal: 128-256 tokens
  • Strategy: Fixed-size or Semantic
  • Why: Precise matching more important than context

Analytical Queries (“How does X compare to Y?”, “Why did Z fail?”):

  • Optimal: 1024+ tokens
  • Strategy: Hierarchical or Semantic
  • Why: Need broader context for reasoning

Mixed Queries (both factoid and analytical):

  • Use: Hierarchical chunking (retrieve small, return large)

By Performance Requirements

Speed Critical (real-time chatbot, high QPS):

  • Use: Fixed-size or Recursive
  • Why: No embedding/LLM overhead

Accuracy Critical (legal, medical, compliance):

  • Use: Semantic, LLM-based, or Agentic
  • Why: Quality justifies cost

Balanced (most production RAG):

  • Use: Recursive chunking (LangChain default)
  • Why: Good context preservation, still fast

By Budget

Low Budget:

  • Fixed-size → Recursive → Document-based
  • Avoid: Semantic, LLM-based, Agentic

Medium Budget:

  • Recursive → Semantic → Hierarchical
  • Use semantic chunking selectively (high-value docs)

High Budget:

  • Any strategy
  • Benchmark and optimize per document type

Chunk Size Optimization: The Data

NVIDIA’s 2024 Benchmark Results

Tested across 5 datasets (FinanceBench, Earnings, KG-RAG, RAGBattlePacket, RAGChallenge):

Page-Level Chunking: 0.648 accuracy, 0.107 std dev (most consistent)

Token-Based Results:

  • 128 tokens: 0.421 accuracy (worst on KG-RAG)
  • 256 tokens: ~0.55 accuracy (good for factoid)
  • 512 tokens: 0.681 accuracy (best for Earnings dataset)
  • 1024 tokens: 0.579-0.804 accuracy (best for FinanceBench, RAGBattlePacket)
  • 2048 tokens: 0.506-0.749 accuracy (underperformed 1024 on most datasets)

Key Finding: Extreme chunk sizes (very small or very large) underperformed. The “sweet spot” is 512-1024 tokens for most content.

Chunk Size by Use Case

128-256 tokens:

  • Medical fact lookup: “What is the dosage for drug X?”
  • Quick reference: “What’s the keyboard shortcut for Y?”
  • Definitions: “Define technical term Z”
  • 14.5% precision improvement with 64-token overlap (Reddit study)

512 tokens:

  • Financial earnings reports: 68.1% accuracy (NVIDIA)
  • Product documentation
  • FAQ responses

1024 tokens:

  • Financial analysis: 57.9% accuracy on FinanceBench
  • Technical guides: 80.4% accuracy on RAGBattlePacket
  • Research summaries

2048+ tokens:

  • Long-form analysis
  • Comparative reports
  • When answer requires broad context

The Overlap Question

Chunk overlap maintains context between adjacent chunks. Common values: 10-20% of chunk size.

Examples:

  • 512 tokens → 50-100 token overlap
  • 1024 tokens → 100-200 token overlap

Impact: Reddit study showed adding 64-token overlap improved dense retrieval precision by 14.5% (0.173 → 0.198).

Trade-off: More overlap = more chunks = higher storage + costs. Find balance through experimentation.

Implementation Guide: LangChain vs LlamaIndex

LangChain Splitters Reference

Best for most use cases: RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

For code: RecursiveCharacterTextSplitter.from_language()

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=500,
    chunk_overlap=50
)

Supported languages: Python, JavaScript, TypeScript, Java, C++, Go, Rust, Ruby, PHP, Swift, Kotlin, C#, SQL, HTML, Markdown, LaTeX.

For structured content:

  • MarkdownHeaderTextSplitter: Split by Markdown headers
  • HTMLHeaderTextSplitter: Split by HTML headers
  • JSONTextSplitter: Split JSON while preserving structure

For tokens: TokenTextSplitter

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # GPT-4 tokenizer
    chunk_size=100,
    chunk_overlap=0
)

For semantic: SemanticChunker

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

LlamaIndex Node Parsers Reference

Best for most use cases: SentenceSplitter

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)

For semantic: SemanticSplitterNodeParser

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

parser = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding()
)

For hierarchical: HierarchicalNodeParser

from llama_index.core.node_parser import HierarchicalNodeParser

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

For code: Use CodeSplitter with language awareness

For context enrichment: SentenceWindowNodeParser

from llama_index.core.node_parser import SentenceWindowNodeParser

parser = SentenceWindowNodeParser(
    window_size=2,  # ±2 sentences of context
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

This indexes at sentence granularity but retrieves ±2 surrounding sentences for context.

Advanced Techniques for 2025

1. Contextual Retrieval (Anthropic)

Prepend LLM-generated context to each chunk before embedding.

How it works:

def add_context_to_chunk(chunk, full_document):
    prompt = f"""Document:
    {full_document}
    
    Chunk:
    {chunk}
    
    Provide brief context (2-3 sentences) explaining this chunk's role in the document."""
    
    context = llm.generate(prompt)
    return f"{context}\n\n{chunk}"

Benefit: Chunks become self-contained. Instead of “It reduces latency by 40%”, you get “Context: This refers to the async/await optimization technique. It reduces latency by 40%.”

Cost: LLM call per chunk (expensive for large corpora).

2. Contextual Embeddings

Embed [document_summary] + [chunk] instead of just [chunk].

def contextual_embed(chunk, document_summary):
    text_to_embed = f"Document: {document_summary}\n\nChunk: {chunk}"
    return embedding_model.embed(text_to_embed)

Benefit: Query “What’s the recommendation?” can match chunk even if chunk doesn’t contain “recommendation” keyword, because document summary does.

3. Metadata Augmentation (LlamaIndex Pattern)

Add synthetic Q&A pairs or titles to chunk metadata.

def augment_metadata(chunk):
    prompt = f"""Generate 3 questions this chunk can answer:
    {chunk}"""
    
    questions = llm.generate(prompt)
    
    return {
        "text": chunk,
        "metadata": {
            "example_questions": questions,
            "auto_title": generate_title(chunk)
        }
    }

Benefit: Retrieval can match against questions in metadata, improving recall.

4. Hybrid Chunking

Combine strategies for different sections.

def hybrid_chunk(document):
    # Detect document structure
    if has_code_blocks(document):
        code_chunks = code_splitter.split(extract_code(document))
    
    if has_tables(document):
        table_chunks = table_splitter.split(extract_tables(document))
    
    # Regular text
    text_chunks = recursive_splitter.split(extract_text(document))
    
    return code_chunks + table_chunks + text_chunks

Use case: Technical documentation with code + explanations.

Common Pitfalls and How to Avoid Them

Pitfall 1: Using Default Chunk Sizes Blindly

Problem: LangChain default is 1000 characters. Your documents might need 256 tokens or 2048 tokens.

Solution: Benchmark on your data. Test 256, 512, 1024, 2048. Measure retrieval accuracy.

Pitfall 2: Ignoring Chunk Overlap

Problem: Zero overlap creates hard boundaries. Context at chunk edges gets lost.

Solution: Use 10-20% overlap. For 512 tokens, use 50-100 overlap.

Pitfall 3: Inconsistent Chunk Sizes

Problem: Fixed-size works great until you hit a 5-word paragraph that becomes its own chunk.

Solution: Combine strategies. Use document-based for structure, then recursive for oversized chunks.

Pitfall 4: Not Considering Query Types

Problem: Using 2048-token chunks for “What is X?” queries (too much irrelevant context).

Solution: Match chunk size to query complexity. Factoid = small, analytical = large.

Pitfall 5: Embedding Overhead Ignored

Problem: Semantic chunking costs $300 for 100K documents. Budget only $50.

Solution: Use semantic chunking selectively (critical docs), fixed-size for the rest.

Pitfall 6: Forgetting to Test

Problem: Assume chunking strategy works, then discover 40% accuracy in production.

Solution: Use LlamaIndex’s ResponseEvaluator or LangChain’s RAGAS metrics. Test on real queries before deploying.

Evaluation Framework

How to know if your chunking strategy works?

Metrics to Track

Retrieval Metrics:

  • Context Recall: How many relevant chunks were retrieved? (Higher = better)
  • Context Precision: How many retrieved chunks are relevant? (Higher = better)
  • Mean Reciprocal Rank (MRR): Is the best chunk ranked first?

Generation Metrics:

  • Faithfulness: Does the answer match retrieved context? (Checks hallucinations)
  • Answer Relevancy: Does the answer address the query?
  • Context Relevancy: Is retrieved context useful for the query?

System Metrics:

  • Average response time: Chunking + embedding + retrieval + generation
  • Chunk count: More chunks = higher storage + slower search
  • Cost per query: Embedding + LLM costs

Testing Workflow (LlamaIndex)

from llama_index.core.evaluation import ResponseEvaluator, FaithfulnessEvaluator
from llama_index.core import VectorStoreIndex

# Create index with your chunking strategy
nodes = your_chunking_strategy(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()

# Evaluate
queries = ["What is X?", "How does Y work?", "Compare A vs B"]

faithfulness_evaluator = FaithfulnessEvaluator()
response_evaluator = ResponseEvaluator()

for query in queries:
    response = query_engine.query(query)
    
    faithfulness = faithfulness_evaluator.evaluate_response(response=response)
    relevancy = response_evaluator.evaluate(query=query, response=response)
    
    print(f"Query: {query}")
    print(f"Faithfulness: {faithfulness.score}")
    print(f"Relevancy: {relevancy.score}")

A/B Testing Chunking Strategies

strategies = {
    "fixed_256": fixed_size_splitter(256, 25),
    "fixed_512": fixed_size_splitter(512, 50),
    "recursive_512": recursive_splitter(512, 50),
    "semantic": semantic_splitter(threshold=95)
}

results = {}

for name, splitter in strategies.items():
    chunks = splitter.split(documents)
    index = create_index(chunks)
    
    # Run evaluation
    scores = evaluate(index, test_queries)
    results[name] = scores

# Compare results
best_strategy = max(results, key=lambda x: results[x]['accuracy'])
print(f"Best strategy: {best_strategy}")

Production Recommendations

For Prototyping

Start with: RecursiveCharacterTextSplitter (LangChain) or SentenceSplitter (LlamaIndex)

Chunk size: 512 tokens, 50 token overlap

Why: Fast, predictable, works for 80% of use cases. Iterate from here.

For Production RAG Systems

High-volume, cost-sensitive:

  • Strategy: Fixed-size or Recursive
  • Size: 512 tokens
  • Overlap: 50 tokens

Accuracy-critical, moderate volume:

  • Strategy: Semantic or Hierarchical
  • Size: Variable (semantic) or 2048/512/128 (hierarchical)
  • Overlap: 10-20%

Mixed content types:

  • Strategy: Document-based + Recursive fallback
  • Detect format (Markdown, HTML, plain text)
  • Apply appropriate splitter per format

Enterprise knowledge base:

  • Strategy: Agentic or Hybrid
  • Cost: High, but justified by accuracy
  • Use agent to select strategy per document

For Specific Domains

Legal/Compliance:

  • Strategy: Hierarchical + LLM-based
  • Size: Large (1024-2048 parent, 256-512 child)
  • Why: Need full context + precise citations

Customer Support:

  • Strategy: Document-based (ticket = chunk)
  • Size: Natural boundaries
  • Why: Each ticket is self-contained

Code Documentation:

  • Strategy: Code-aware recursive
  • Size: 500 tokens (preserves functions)
  • Language: Specify (Python, JavaScript, etc.)

Scientific Research:

  • Strategy: Semantic
  • Size: 512-1024 tokens
  • Why: Topics shift without clear headers

The Future of Chunking (2025 and Beyond)

Trend 1: Long-Context Models Reduce Chunking Need

GPT-4 Turbo (128K tokens), Claude 3.5 Sonnet (200K tokens), Gemini 1.5 Pro (1M tokens).

Implication: For documents under context limit, you can skip chunking entirely and embed full document.

Reality Check: Even with 200K context, chunking still improves:

  • Retrieval precision (smaller chunks = better matching)
  • Cost (pay for only relevant chunks, not full 200K tokens)
  • Latency (process 5 chunks faster than 1 giant document)

Trend 2: Agentic Chunking Goes Mainstream

As LLM costs drop (GPT-4o-mini is 60× cheaper than GPT-4), agentic chunking becomes viable.

Prediction: By late 2025, production RAG systems will use agents to select chunking strategies dynamically.

Trend 3: Embeddings Become Chunking-Aware

Late chunking and contextual embeddings are early examples. Future embedding models will natively handle:

  • Multi-granularity (embed at token, sentence, paragraph levels simultaneously)
  • Cross-chunk awareness (embeddings reference neighboring chunks)
  • Structure preservation (Markdown headers, code blocks embedded with structural metadata)

Example: Cohere’s Embed v3 (2024) already supports multi-granular embeddings.

Trend 4: Chunking-Free RAG

Research explores skipping chunking via:

  • Token-level retrieval: Index individual tokens, retrieve precise spans
  • Neural search: Transformer models that search full documents without chunking
  • Graph-based RAG: Represent documents as knowledge graphs, retrieve subgraphs

Current state: Experimental. Traditional chunking still dominates production.

Quick Reference Chart

Final Recommendations

Start simple: Use RecursiveCharacterTextSplitter with 512 tokens and 50 overlap. This works for 80% of RAG applications.

Measure, don’t guess: Benchmark on your data with your queries. Chunk size that works for financial reports may fail for customer support tickets.

Match strategy to content: Structured docs → document-based. Narrative content → semantic. Mixed content → agentic or hybrid.

Watch your budget: Semantic chunking costs scale linearly with document count. For 1M documents, that’s thousands in embedding costs.

Consider query patterns: Factoid queries prefer small chunks (256). Analytical queries need large chunks (1024+).

Use overlap: 10-20% overlap prevents context loss at boundaries. 14.5% precision improvement is worth the storage cost.

Test in production: Offline metrics don’t always predict real-world performance. A/B test strategies with real users.

Future-proof: Long-context models (128K+ tokens) reduce chunking need but don’t eliminate it. Chunking still improves precision and reduces costs.

The best chunking strategy is the one that works for your documents, your queries, and your constraints. Start with recursive, measure performance, and iterate.