Retrieval-Augmented Generation (RAG) has become the standard architecture for LLM applications that need accurate, up-to-date information. However, naive RAG implementations often fail in production. Here’s how to build systems that actually work.
Why Basic RAG Fails
Simple vector similarity search has critical limitations:
- Poor performance on exact matches (product codes, names)
- Struggles with low-frequency terms and acronyms
- No understanding of document structure or metadata
- Sensitive to query phrasing variations
Hybrid search solves these issues by combining dense (semantic) and sparse (keyword) retrieval.
Architecture Overview
User Query
↓
Query Enhancement (expansion, rewriting)
↓
Parallel Retrieval
├→ Dense Search (embeddings)
└→ Sparse Search (BM25)
↓
Result Fusion (RRF)
↓
Reranking (cross-encoder)
↓
Context Construction
↓
LLM Generation
Vector Database Selection
DatabaseBest ForHybrid SearchQdrantHigh performance, Rust-basedNativeWeaviateRich features, GraphQL APINativeTypesenseTypo tolerance, facetingExcellentMilvusMassive scale (>1B vectors)Via pluginpgvectorPostgreSQL integrationManualFor most applications, Qdrant or Weaviate provide the best balance of features and performance.
Implementing Dense Search
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(url="http://localhost:6333")
# Create collection
client.create_collection(
collection_name="documents",
vectors_config={
"dense": VectorParams(
size=1024, # e.g., OpenAI text-embedding-3-large
distance=Distance.COSINE
)
}
)
# Index documents
from openai import OpenAI
openai = OpenAI()
def embed_text(text: str):
response = openai.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
# Batch insert
points = []
for doc_id, doc in enumerate(documents):
points.append({
"id": doc_id,
"vector": {"dense": embed_text(doc["content"])},
"payload": doc
})
client.upsert(collection_name="documents", points=points)
Adding Sparse Search
from qdrant_client.models import SparseVector
# Configure sparse vectors (BM25-like)
client.create_collection(
collection_name="documents",
vectors_config={
"dense": VectorParams(size=1024, distance=Distance.COSINE)
},
sparse_vectors_config={
"sparse": SparseVectorParams()
}
)
# Use SPLADE or BM25 for sparse encoding
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("naver/splade-cocondenser-ensembledistil")
model = AutoModelForMaskedLM.from_pretrained("naver/splade-cocondenser-ensembledistil")
def sparse_encode(text: str):
tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)
# Extract sparse vector (token activations)
vec = torch.max(torch.log(1 + torch.relu(output.logits)), dim=1).values
return SparseVector(indices=vec.nonzero().flatten().tolist(),
values=vec[vec > 0].tolist())
Hybrid Search Query
from qdrant_client.models import Prefetch, QueryRequest
def hybrid_search(query: str, limit: int = 10):
dense_vec = embed_text(query)
sparse_vec = sparse_encode(query)
results = client.query_points(
collection_name="documents",
prefetch=[
# Dense search
Prefetch(
query=dense_vec,
using="dense",
limit=20 # over-fetch
),
# Sparse search
Prefetch(
query=sparse_vec,
using="sparse",
limit=20
)
],
query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion
limit=limit
)
return results
Reranking for Precision
After retrieval, rerank with a cross-encoder for maximum relevance:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, documents: list, top_k: int = 5):
pairs = [[query, doc["content"]] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
# Full pipeline
results = hybrid_search(query, limit=20)
final_docs = rerank(query, results, top_k=5)
Query Enhancement Techniques
1. Query Expansion:
def expand_query(query: str) -> str:
prompt = f"""Generate 2 alternative phrasings of this query:
Query: {query}
Alternative phrasings:
1."""
response = llm.generate(prompt)
expansions = parse_expansions(response)
return query + " " + " ".join(expansions)
2. Query Decomposition:
def decompose_query(complex_query: str) -> list[str]:
"""Break complex queries into sub-queries"""
prompt = f"""Break this complex question into 2-3 simpler sub-questions:
Question: {complex_query}
Sub-questions:
1."""
sub_queries = llm.generate(prompt).strip().split("n")
return sub_queries
# Retrieve for each sub-query and combine
Chunking Strategies
Chunk size dramatically impacts retrieval quality:
StrategyChunk SizeUse CaseFixed512 tokensSimple, fastSentenceVariablePreserves meaningSemanticVariableTopic coherenceRecursiveHierarchicalLong documents“`python from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, # maintain context separators=[“nn”, “n”, “. “, ” “, “”] )
chunks = splitter.split_text(document)
## Metadata Filtering
Combine vector search with metadata filters for precise results:
```python
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.query_points(
collection_name="documents",
query=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="technical")
),
FieldCondition(
key="date",
range=DateRange(gte="2025-01-01")
)
]
),
limit=10
)
Context Construction
Optimize how you present retrieved chunks to the LLM:
def build_context(query: str, docs: list) -> str:
context_parts = []
for i, doc in enumerate(docs, 1):
# Include metadata for provenance
context_parts.append(f"""
Document {i} (Source: {doc["source"]}, Date: {doc["date"]}):
{doc["content"]}
---
""")
return "n".join(context_parts)
prompt = f"""Use the following documents to answer the question.
If the answer is not in the documents, say so.
{context}
Question: {query}
Answer:"""
Caching for Performance
import redis
import hashlib
redis_client = redis.Redis(host="localhost", port=6379, db=0)
def cached_search(query: str, ttl: int = 3600):
# Cache key from query hash
cache_key = f"rag:{hashlib.sha256(query.encode()).hexdigest()}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Perform search
results = hybrid_search(query)
# Cache results
redis_client.setex(cache_key, ttl, json.dumps(results))
return results
Monitoring and Evaluation
Track key metrics in production:
- Retrieval metrics: Recall@k, MRR, nDCG
- Generation metrics: Faithfulness, answer relevance
- System metrics: Latency p95, cache hit rate
- User metrics: Thumbs up/down, follow-up questions
Cost Optimization
# Costs for 1M queries/month
# Embeddings
1M queries × $0.13/1M tokens = $0.13
# Vector DB (Qdrant Cloud)
Standard tier: $99/month
# LLM (Claude 3.5 Sonnet)
1M queries × 1K tokens × $3/1M = $3,000
Total: ~$3,100/month
The LLM is by far the largest cost. Optimize by caching, using smaller models when possible, and efficient prompt engineering.
Frequently Asked Questions
Why is hybrid search better than pure vector search?
Vector search captures semantic meaning but misses exact-match cues like product codes, names, and rare terms. BM25 catches those. Combining the two with reciprocal rank fusion or a learned reranker gives 30-50% better recall on production queries.
Do I need a reranker for RAG?
For high-stakes retrieval, yes. Rerankers like Cohere or BGE rerank the top 50 candidates and push the genuinely relevant chunks to the top. They add latency and cost, but improve answer quality more than tweaking embeddings.
Which embedding model should I choose?
OpenAI text-embedding-3-large is a safe default. For open-source, BGE-large-en-v1.5 or NV-Embed are competitive. For non-English, consider multilingual-e5. Always evaluate on your own data, not on the leaderboard.
How do I evaluate a RAG system?
Build a golden set of 50-200 question-answer pairs from real users. Measure retrieval recall@k (did the right chunk appear?) and answer quality with an LLM judge. Track both as you change embeddings, chunking, and prompts.
What’s the biggest failure mode in RAG?
Bad chunking. Cutting documents at wrong boundaries means the right answer is split across chunks none of which retrieve well. Spend more time on chunking strategy than on swapping embedding models.