In a previous guide, we built a basic PDF question answering system in thirty minutes. Many developers noticed that while it handles straightforward queries well, it struggles when encountering specialized terminology, lengthy corporate documents, or phrasing that deviates from the source text. The system can pull the wrong data, miss critical context, or completely fail to surface relevant answers.
This behavior is not a flaw in your large language model. It is the natural limitation of using a single retrieval strategy. Whether you rely solely on dense retrieval using semantic vectors or sparse retrieval using keyword matching, both methods have distinct blind spots. Dense retrieval handles abstract meaning well but misses exact technical terms. Sparse retrieval nails specific keywords but fails if the user words the question differently.
This guide resolves that core pain point by upgrading your RAG architecture with advanced hybrid search combining sparse search, dense search, and cross encoder reranking to maximize precision. We will walk through the concepts and provide production ready code so your pipeline can simultaneously grasp abstract intent and lock onto exact keywords.
Why Single Retrieval Strategies Fall Short
Relying on one retrieval method leaves clear gaps in your data architecture.
Dense retrieval converts text into vector embeddings to measure semantic similarity. Think of it as searching for a document by its overall theme rather than its exact wording. This approach shines when users ask conversational questions using synonyms. If your document states that the company offers five days of annual leave and a user asks about vacation time, dense retrieval easily bridges that gap. However, dense retrieval struggles with rare technical jargon, specific product codes, or medical terminology. These unique terms often get diluted during vector encoding. It is also highly vulnerable to typos or text abbreviations, and its accuracy drops if key information sits right on the edge of a text chunk boundary.
Conversely, sparse retrieval tools like the BM25 algorithm evaluate keyword frequency and document statistics. This is like searching for information by matching exact words in a book index. It is incredibly stable and highly sensitive to unique terms. Even if a complex legal term appears only once in your document, sparse retrieval will find it instantly. It executes rapidly because it bypasses heavy vector computations. Yet, sparse retrieval lacks semantic awareness. If a document discusses compensation packages but the user asks about salary details, sparse retrieval will flag it as irrelevant because the exact word does not match.
| Dimension | Dense Retrieval (Vector Search) | Sparse Retrieval (BM25) |
|---|---|---|
| Core Logic | Semantic similarity matching | Keyword frequency and weight matching |
| Key Strengths | Grasps context and handles synonyms | Locks onto specific terms and offers high speed |
| Key Weaknesses | Misses rare technical terms; relies on text quality | Lacks semantic understanding; fails with altered phrasing |
| Best Use Case | Conversational queries and conceptual matching | Exact product codes and specialized jargon |
| Standard Tools | FAISS, Milvus, BGE Embedding Models | BM25, Elasticsearch |
Combining these two methodologies creates a hybrid search pipeline that blends semantic comprehension with keyword precision.
Mechanics of Contemporary Hybrid Search and Reranking
Modern hybrid retrieval goes beyond simply stitching two lists of results together. It runs a parallel execution pipeline followed by score normalization and a secondary cross encoder validation step.
The process begins with text preprocessing. Both the user query and the document chunks undergo tokenization and stop word removal to strip out meaningless terms. This step ensures that both retrieval engines analyze clean, standardized text inputs.
Next, the system executes parallel dual engine retrieval. The query is sent simultaneously to the dense vector engine to fetch the top semantic matches and to the BM25 engine to pull the top keyword matches. Running these searches concurrently preserves low latency.
The critical phase is result fusion and normalization. Because vector similarity scores and BM25 scores use completely different scales, we must normalize them before combining them. A common production strategy uses min max scaling to bring both sets of scores into a zero to one range. We then apply relative weights, such as sixty percent emphasis on dense scores and forty percent on sparse scores, to calculate a unified hybrid score.
The final layer introduces a cross encoder reranking model. While the initial hybrid search is fast and acts as an effective filter, it evaluates queries and documents independently. A reranker inspects the query and the retrieved text chunks together, analyzing deep contextual relationships to output an ultra precise relevance score. The system then sorts the chunks based on this final score and passes the most relevant pieces to the language model.
Production Implementation: Code Walkthrough
We will now rebuild the RAG retrieval module. We will integrate BM25 sparse search, implement normalized hybrid fusion, and add a cross encoder reranker. The code uses modern standards, including the official OpenAI client package configured for DeepSeek models to ensure optimal compatibility.
First, update your environment with the required packages.
pip install rank_bm25 jieba faiss-cpu sentence-transformers openai pypdf
Below is the complete implementation of the upgraded pipeline.
import os
import jieba
import numpy as np
import faiss
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
from openai import OpenAI
# Global pipeline configurations
PDF_PATH = "your_document.pdf"
DENSE_WEIGHT = 0.6
SPARSE_WEIGHT = 0.4
TOP_K_RETRIEVAL = 10
TOP_K_FINAL = 4
def preprocess_chinese_text(text):
"""Tokenizes text and removes common stop words for sparse indexing."""
stop_words = {"的", "了", "是", "我", "你", "他", "在", "有", "就", "和", "也", "都", "以", "而"}
tokens = jieba.lcut(text.strip().lower())
return [token for token in tokens if token not in stop_words and token.strip()]
def extract_and_chunk_pdf(pdf_path):
"""Extracts raw text from PDF and splits it into manageable chunks."""
reader = PdfReader(pdf_path)
raw_text = ""
for page in reader.pages:
raw_text += page.extract_text() or ""
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len
)
return [chunk for chunk in splitter.split_text(raw_text) if chunk.strip()]
class AdvancedHybridRetriever:
"""Manages dual engine search and cross encoder reranking."""
def __init__(self, corpus):
self.corpus = corpus
# Initialize modern open source embedding and reranking models
self.dense_model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
self.reranker = CrossEncoder("BAAI/bge-reranker-large")
# Build the dense index using normalized inner product for cosine similarity
embeddings = self.dense_model.encode(corpus, show_progress_bar=False)
dimension = embeddings.shape[1]
self.faiss_index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(embeddings)
self.faiss_index.add(embeddings)
# Build the sparse index using BM25
tokenized_corpus = [preprocess_chinese_text(doc) for doc in corpus]
self.bm25 = BM25Okapi(tokenized_corpus)
def retrieve(self, query):
# 1. Dense Semantic Search
query_vector = self.dense_model.encode([query])
faiss.normalize_L2(query_vector)
dense_scores, dense_indices = self.faiss_index.search(query_vector, TOP_K_RETRIEVAL)
dense_results = {idx: float(score) for idx, score in zip(dense_indices[0], dense_scores[0]) if idx != -1}
# 2. Sparse Keyword Search
tokenized_query = preprocess_chinese_text(query)
sparse_scores = self.bm25.get_scores(tokenized_query)
# Normalize sparse scores using min max logic to match dense distribution
max_sparse = np.max(sparse_scores) if np.max(sparse_scores) > 0 else 1.0
min_sparse = np.min(sparse_scores)
sparse_range = max_sparse - min_sparse if max_sparse != min_sparse else 1.0
# 3. Score Fusion
combined_scores = {}
candidate_indices = set(dense_results.keys()).union(set(np.argsort(sparse_scores)[-TOP_K_RETRIEVAL:]))
for idx in candidate_indices:
d_score = dense_results.get(idx, 0.0)
s_score = (sparse_scores[idx] - min_sparse) / sparse_range if max_sparse > 0 else 0.0
combined_scores[idx] = (d_score * DENSE_WEIGHT) + (s_score * SPARSE_WEIGHT)
# Extract top candidates based on hybrid score
sorted_candidates = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:TOP_K_RETRIEVAL]
retrieved_chunks = [self.corpus[idx] for idx, _ in sorted_candidates]
# 4. Cross Encoder Reranking
reranker_pairs = [[query, chunk] for chunk in retrieved_chunks]
rerank_scores = self.reranker.predict(reranker_pairs)
# Sort candidates by final contextual relevance scores
final_rankings = np.argsort(rerank_scores)[::-1][:TOP_K_FINAL]
return [retrieved_chunks[i] for i in final_rankings]
def ask_llm(query, context_chunks):
"""Sends the context and query to DeepSeek via standard OpenAI client integration."""
if not context_chunks:
return "No relevant documentation found."
client = OpenAI(
api_key=os.getenv("DEEPSEEK_API_KEY", "your_api_key_here"),
base_url="https://api.deepseek.com/v1"
)
context_text = "\n\n".join(context_chunks)
prompt = f"Using the following documentation, answer the question accurately.\n\nContext:\n{context_text}\n\nQuestion: {query}"
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
return response.choices[0].message.content
if __name__ == "__main__":
print("Extracting text and chunking document...")
document_chunks = extract_and_chunk_pdf(PDF_PATH)
print(f"Created {len(document_chunks)} discrete text units.")
print("Initializing hybrid retrieval engines and reranker...")
search_engine = AdvancedHybridRetriever(document_chunks)
print("Pipeline ready for queries.")
user_query = "What are the specific penalty terms for contract violations?"
matched_contexts = search_engine.retrieve(user_query)
final_answer = ask_llm(user_query, matched_contexts)
print(f"\nQuery: {user_query}")
print(f"Answer:\n{final_answer}")
Optimization Strategies for Advanced Deployments
Dynamic Parameter Tuning
Instead of keeping your dense and sparse weights static, you can adjust them based on the incoming query structure. If the query contains highly specific technical identifiers, alphanumeric product codes, or exact serial numbers, programmatic detection can shift the sparse weight higher. If the user query is long and conversational, the system can favor the dense retrieval engine.
Advanced Semantic Chunking
Fixed size token chunking often breaks paragraphs mid sentence, destroying the semantic unity of the text. Upgrading to semantic chunking allows your system to monitor embedding drift between sentences. The document breaks only when a meaningful shift in topic occurs, ensuring that the dense vector representations remain pure and contextually complete.
Cross Encoder Reranking Scaling
While cross encoders provide exceptional accuracy, they introduce higher computational latency compared to bi encoders. To maintain high performance in production environments, implement a two tier architecture. Use the hybrid retrieval layer to filter thousands of documents down to a small candidate pool of perhaps fifteen or twenty blocks, then pass only that small pool to the reranker.
Resolving Common Deployment Challenges
Empty Keyword Tokens
When a query consists purely of common stop words, the tokenization process can return an empty list, forcing the sparse engine to output flat zeros. To resolve this, implement a fallback mechanism that bypasses the sparse scoring stage completely if no valid keywords remain after filtering.
Latency Bottlenecks
If your database scales to millions of chunks, running unindexed keyword searches alongside dense calculations will slow down response times. Ensure that your sparse index leverages an optimized inverted index framework like Elasticsearch or Milvus sparse vectors, and limit the initial retrieval count to a tight window.
Mismatched Reranker Scales
Using a reranker model trained on general web text might downgrade the relevance of highly specialized internal corporate documentation. Always verify that your cross encoder matches the primary language and domain of your dataset, or consider fine tuning a lightweight open source model on your specific documentation history.
Modern RAG success relies heavily on high quality data retrieval. Moving to a hybrid architecture backed by dual engines and cross encoder validation removes the structural blind spots of single strategy systems. By implementing these patterns, you bridge the gap between abstract understanding and exact keyword tracking, giving your enterprise application production level precision.