Production RAG Systems with Hybrid Search

Retrieval-Augmented Generation (RAG) has become the standard architecture for LLM applications that need accurate, up-to-date information. However, naive RAG implementations often fail in production. Here’s how to build systems that actually work.

Why Basic RAG Fails

Simple vector similarity search has critical limitations:

Poor performance on exact matches (product codes, names)
Struggles with low-frequency terms and acronyms
No understanding of document structure or metadata
Sensitive to query phrasing variations

Hybrid search solves these issues by combining dense (semantic) and sparse (keyword) retrieval.

Architecture Overview

User Query
    ↓
Query Enhancement (expansion, rewriting)
    ↓
Parallel Retrieval
    ├→ Dense Search (embeddings)
    └→ Sparse Search (BM25)
    ↓
Result Fusion (RRF)
    ↓
Reranking (cross-encoder)
    ↓
Context Construction
    ↓
LLM Generation

Vector Database Selection

DatabaseBest ForHybrid SearchQdrantHigh performance, Rust-basedNativeWeaviateRich features, GraphQL APINativeTypesenseTypo tolerance, facetingExcellentMilvusMassive scale (>1B vectors)Via pluginpgvectorPostgreSQL integrationManualFor most applications, Qdrant or Weaviate provide the best balance of features and performance.

Implementing Dense Search

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(
            size=1024,  # e.g., OpenAI text-embedding-3-large
            distance=Distance.COSINE
        )
    }
)

# Index documents
from openai import OpenAI
openai = OpenAI()

def embed_text(text: str):
    response = openai.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Batch insert
points = []
for doc_id, doc in enumerate(documents):
    points.append({
        "id": doc_id,
        "vector": {"dense": embed_text(doc["content"])},
        "payload": doc
    })

client.upsert(collection_name="documents", points=points)

Adding Sparse Search

from qdrant_client.models import SparseVector

# Configure sparse vectors (BM25-like)
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(size=1024, distance=Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams()
    }
)

# Use SPLADE or BM25 for sparse encoding
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("naver/splade-cocondenser-ensembledistil")
model = AutoModelForMaskedLM.from_pretrained("naver/splade-cocondenser-ensembledistil")

def sparse_encode(text: str):
    tokens = tokenizer(text, return_tensors="pt")
    output = model(**tokens)
    # Extract sparse vector (token activations)
    vec = torch.max(torch.log(1 + torch.relu(output.logits)), dim=1).values
    return SparseVector(indices=vec.nonzero().flatten().tolist(),
                       values=vec[vec > 0].tolist())

Hybrid Search Query

from qdrant_client.models import Prefetch, QueryRequest

def hybrid_search(query: str, limit: int = 10):
    dense_vec = embed_text(query)
    sparse_vec = sparse_encode(query)

    results = client.query_points(
        collection_name="documents",
        prefetch=[
            # Dense search
            Prefetch(
                query=dense_vec,
                using="dense",
                limit=20  # over-fetch
            ),
            # Sparse search
            Prefetch(
                query=sparse_vec,
                using="sparse",
                limit=20
            )
        ],
        query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
        limit=limit
    )

    return results

Reranking for Precision

After retrieval, rerank with a cross-encoder for maximum relevance:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, documents: list, top_k: int = 5):
    pairs = [[query, doc["content"]] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

# Full pipeline
results = hybrid_search(query, limit=20)
final_docs = rerank(query, results, top_k=5)

Query Enhancement Techniques

1. Query Expansion:

def expand_query(query: str) -> str:
    prompt = f"""Generate 2 alternative phrasings of this query:

Query: {query}

Alternative phrasings:
1."""

    response = llm.generate(prompt)
    expansions = parse_expansions(response)
    return query + " " + " ".join(expansions)

2. Query Decomposition:

def decompose_query(complex_query: str) -> list[str]:
    """Break complex queries into sub-queries"""
    prompt = f"""Break this complex question into 2-3 simpler sub-questions:

Question: {complex_query}

Sub-questions:
1."""

    sub_queries = llm.generate(prompt).strip().split("n")
    return sub_queries

# Retrieve for each sub-query and combine

Chunking Strategies

Chunk size dramatically impacts retrieval quality:

StrategyChunk SizeUse CaseFixed512 tokensSimple, fastSentenceVariablePreserves meaningSemanticVariableTopic coherenceRecursiveHierarchicalLong documents“`python from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, # maintain context separators=[“nn”, “n”, “. “, ” “, “”] )

chunks = splitter.split_text(document)


## Metadata Filtering

Combine vector search with metadata filters for precise results:

```python
from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.query_points(
    collection_name="documents",
    query=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="technical")
            ),
            FieldCondition(
                key="date",
                range=DateRange(gte="2025-01-01")
            )
        ]
    ),
    limit=10
)

Context Construction

Optimize how you present retrieved chunks to the LLM:

def build_context(query: str, docs: list) -> str:
    context_parts = []

    for i, doc in enumerate(docs, 1):
        # Include metadata for provenance
        context_parts.append(f"""
Document {i} (Source: {doc["source"]}, Date: {doc["date"]}):
{doc["content"]}
---
""")

    return "n".join(context_parts)

prompt = f"""Use the following documents to answer the question.
If the answer is not in the documents, say so.

{context}

Question: {query}
Answer:"""

Caching for Performance

import redis
import hashlib

redis_client = redis.Redis(host="localhost", port=6379, db=0)

def cached_search(query: str, ttl: int = 3600):
    # Cache key from query hash
    cache_key = f"rag:{hashlib.sha256(query.encode()).hexdigest()}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Perform search
    results = hybrid_search(query)

    # Cache results
    redis_client.setex(cache_key, ttl, json.dumps(results))

    return results

Monitoring and Evaluation

Track key metrics in production:

Retrieval metrics: Recall@k, MRR, nDCG
Generation metrics: Faithfulness, answer relevance
System metrics: Latency p95, cache hit rate
User metrics: Thumbs up/down, follow-up questions

Cost Optimization

# Costs for 1M queries/month

# Embeddings
1M queries × $0.13/1M tokens = $0.13

# Vector DB (Qdrant Cloud)
Standard tier: $99/month

# LLM (Claude 3.5 Sonnet)
1M queries × 1K tokens × $3/1M = $3,000

Total: ~$3,100/month

The LLM is by far the largest cost. Optimize by caching, using smaller models when possible, and efficient prompt engineering.

Frequently Asked Questions

Why is hybrid search better than pure vector search?

Vector search captures semantic meaning but misses exact-match cues like product codes, names, and rare terms. BM25 catches those. Combining the two with reciprocal rank fusion or a learned reranker gives 30-50% better recall on production queries.

Do I need a reranker for RAG?

For high-stakes retrieval, yes. Rerankers like Cohere or BGE rerank the top 50 candidates and push the genuinely relevant chunks to the top. They add latency and cost, but improve answer quality more than tweaking embeddings.

Which embedding model should I choose?

OpenAI text-embedding-3-large is a safe default. For open-source, BGE-large-en-v1.5 or NV-Embed are competitive. For non-English, consider multilingual-e5. Always evaluate on your own data, not on the leaderboard.

How do I evaluate a RAG system?

Build a golden set of 50-200 question-answer pairs from real users. Measure retrieval recall@k (did the right chunk appear?) and answer quality with an LLM judge. Track both as you change embeddings, chunking, and prompts.

What’s the biggest failure mode in RAG?

Bad chunking. Cutting documents at wrong boundaries means the right answer is split across chunks none of which retrieve well. Spend more time on chunking strategy than on swapping embedding models.