Retriever Architecture
Now let’s talk about one of the most important parts of any production RAG system:
Retriever Architecture
If embeddings are the language…
and vector databases are the storage…
then the retriever is the brain.
Because this is the part that decides:
What information should the model actually read?
And that decision changes everything.
Retrieval Is Usually Not One Step
Many beginners imagine retrieval like this:
user asks a question → system finds the answer
But in real-world RAG systems…
retrieval is usually much smarter.
It typically happens in:
Two Stages
And this dual-stage design is what gives us both:
speed and accuracy
Because production systems need both.
Fast systems without accuracy are useless.
Accurate systems without speed are unusable.
So we combine both.
Let’s break it down.
Stage 1 — Fast Recall
The first stage is all about:
Speed
Its only job is to quickly fetch a large pool of potentially relevant chunks.
This is called:
The Recall Stage
At this point, we do not need perfect ranking.
We just need one thing:
Don’t miss useful chunks
That is the goal.
How This Stage Works
This stage usually uses something called a:
Bi-Encoder Retriever
Here’s the idea:
- the user query is converted into an embedding
- all document chunks already have embeddings
Both are encoded independently.
Then the system performs:
ANN Search
to find the nearest vectors.
Because this is embedding-based retrieval, it is extremely fast.
This is exactly where vector databases like:
- Pinecone
- Milvus
- Qdrant
come into play.
What Gets Retrieved?
Typically, this stage retrieves something like:
- top 50 chunks
- top 100 chunks
- sometimes even top 200 chunks
Why so many?
Because the idea is simple:
retrieve a superset of potentially useful information
At this stage, it is okay if some irrelevant chunks sneak in.
Because speed matters most.
Precision comes later.
Stage 2 — Accurate Reranking
Now Accuracy Takes Over
Once we have the candidate chunks…
the next question becomes:
Which ones are actually the best?
This is where the second stage begins.
And this stage is all about:
Accuracy
The Role of the Reranker
This stage uses something called a:
Reranker
Its job is to take the top candidate chunks and rank them properly.
Because not every retrieved chunk is equally useful.
Some are much better than others.
We need the best ones.
The most common reranker is a:
Cross-Encoder
This is more powerful than a bi-encoder.
Bi-Encoder vs Cross-Encoder
Bi-Encoder
Query and document are encoded separately
→ fast
Cross-Encoder
Query and document are processed together
→ slower, but much more accurate
Because now the model can deeply understand:
the actual relationship between the query and the chunk
Not just vector similarity.
But real contextual relevance.
That makes reranking much stronger.
Why We Don’t Use Cross-Encoder Everywhere
Because it is slower.
Much slower.
Running it across millions of documents would be too expensive.
So we only use it on:
the top few candidates
Usually after Stage 1 narrows things down.
That gives us:
fast recall + accurate ranking
And that is exactly what production RAG systems need.
Multi-Query Retrieval
Now some advanced systems go one step further.
They use something called:
Multi-Query Retrieval
This is incredibly useful for complex questions.
A Real Example
Suppose the user asks:
“How does RAG improve accuracy and reduce hallucinations?”
This is not really one question.
It contains multiple ideas.
For example:
- how RAG improves accuracy
- how RAG reduces hallucination
- how retrieval quality affects final answers
If we search using only one query, we may miss important information.
So instead…
the system breaks it into multiple sub-queries.
Each one retrieves different chunks.
That improves:
- retrieval coverage
- context diversity
- final answer quality
Especially in enterprise search systems.
Top-K Retrieval & Thresholding
Now let’s talk about a very important retrieval question:
How many chunks should we actually retrieve?
Because once the vector database finds similar embeddings…
the job is still not finished.
We still need to decide:
How much context should be sent to the LLM?
And this decision matters a lot.
Too little context…
and the model may miss important information.
Too much context…
and the prompt becomes noisy, expensive, and less accurate.
This is where two important strategies come in:
- Top-K Retrieval
- Threshold-Based Retrieval
These are small decisions…
but they have a huge impact on answer quality.
What Is Top-K Retrieval?
This simply means:
Return the top K most similar chunks
That’s it.
Very simple.
Very common.
A Few Examples
For example:
- Top 5
- Top 10
- Top 20
If we set:
k=10
the system returns the:
10 nearest neighbors
based on similarity score.
These are the chunks considered most semantically relevant to the user’s query.
This is the default approach in most RAG systems.
Because it is:
- simple
- predictable
- easy to tune
Let’s say a user asks:
“What is semantic chunking?”
The system may retrieve:
the top 10 most relevant chunks
related to chunking, retrieval, and semantic segmentation.
Those chunks are then added to the prompt.
And the LLM uses them to generate the final answer.
That is standard Top-K retrieval.
Why Choosing K Matters
Now here comes the important part.
Choosing the value of K is not random.
It should be based on real engineering decisions.
And usually, two major factors matter most.
Factor 1 — LLM Context Budget
The first factor is:
Context Window Size
Every LLM has a limit on how much text it can process in a single prompt.
This is called the:
context window
If you retrieve too many chunks…
you may:
- exceed the model’s limit
- increase token cost
- flood the prompt with irrelevant content
And surprisingly…
more context does not always mean better answers.
Sometimes too much context actually reduces quality.
Because the important signal gets buried inside noise.
That is why K must be chosen carefully.
Factor 2 — Query Complexity
The second factor is:
Query Complexity
Not every question needs the same amount of context.
A simple question may only need:
3 to 5 chunks
But a complex, multi-part question may need much more.
For example:
“Compare Pinecone, Weaviate, and Milvus for enterprise RAG systems”
This question touches:
- scalability
- cost
- deployment
- filtering
- infrastructure choices
That likely requires a larger retrieval set.
So sometimes:
bigger question = bigger K
And that makes perfect sense.
Threshold-Based Retrieval
Now let’s look at another strategy.
Instead of choosing a fixed K…
some systems use:
Similarity Thresholding
This works differently.
How Thresholding Works
Instead of saying:
Always return 10 chunks
we say:
Return every chunk above a certain similarity score
For example:
Cosine Similarity>0.85
That means:
Only retrieve chunks that are truly relevant.
No unnecessary filler.
This creates a much more dynamic retrieval system.
Why This Is Powerful
Sometimes only:
3 chunks
are highly relevant.
And sometimes:
15 chunks
deserve to be included.
Thresholding adapts automatically.
That flexibility can improve retrieval quality significantly.
Especially when similarity scores are reliable.
This makes the system smarter than fixed Top-K alone.
A Simple Way to Remember It
Think of it like this:
Top-K says:
Always give me the best 10
Thresholding says:
Give me everything that is truly relevant
Both are valid.
Both are useful.
The best choice depends on your use case.
Multi-Hop Retrieval
So far, most of our retrieval examples looked simple:
user asks a question system retrieves relevant chunks LLM generates the answer
Clean.
Fast.
Straightforward.
But in real-world systems…
not every question is that simple.
Some questions cannot be answered in a single retrieval step.
They require:
multiple connected retrievals
And this is where we enter a much more advanced concept:
Multi-Hop Retrieval
This is one of the most powerful ideas in advanced RAG systems.
Think of It Like Solving a Puzzle
Imagine solving a puzzle.
You don’t get the final answer from one piece.
You first find:
- piece one
which helps you locate:
- piece two
and only after connecting both…
you understand the full picture.
That is exactly how multi-hop retrieval works.
Instead of one retrieval…
the system performs retrieval in stages.
One answer leads to the next search.
That is why it is called:
multi-hop
Because the system “hops” from one piece of knowledge to another.
A Simple Example
Let’s take a question like:
“Where is Bengaluru, and what is its population?”
At first glance, this looks like one question.
But actually…
it requires two steps.
First Hop
The system first needs to retrieve:
Where is Bengaluru located?
The answer might be:
Karnataka, India
Second Hop
Now, using that context, the system performs the next retrieval:
What is the population of Bengaluru?
Only after both steps are connected…
can the final answer be complete.
That is multi-hop retrieval.
How It Works Internally
So how does the system actually do this?
There are usually two major approaches.
Approach 1 — Iterative Vector DB Queries
The first method is:
Iterative Querying
This is the most intuitive approach.
How It Works
- Query the vector database once
- Use that result to generate the next query
- Query again
- Repeat until enough context is gathered
It works like following breadcrumbs.
Each answer leads to the next search.
This approach is simple and very effective.
Especially for structured reasoning tasks.
Approach 2 — Chained LLM Calls
The second method is more advanced.
Here we use:
Chained LLM Calls
This often involves:
Chain-of-Thought Reasoning
Instead of manually deciding the next retrieval step…
the LLM helps plan it.
How This Works
- First retrieved chunk is passed to the LLM
- The LLM decides what information is missing
- It generates the next retrieval query
- Retrieval happens again
- The cycle continues step by step
This is very common in:
Agentic RAG Systems
Here, the model actively decides its own retrieval path.
That makes the system far more powerful.
Especially for complex enterprise workflows.
Why Multi-Hop Retrieval Matters
This becomes extremely important for:
- reasoning-heavy tasks
- fact chaining
- enterprise document linking
- legal research
- knowledge graph style questions
- investigative search systems
Because many real-world questions are not answerable from a single chunk.
They require connected reasoning across multiple documents.
And simple one-step retrieval is not enough.
Query Rewriting & Expansion
Now here’s something incredibly important in RAG that many beginners completely miss.
Sometimes…
the problem is not:
- the vector database
- the embeddings
- the retriever
- the chunking strategy
Sometimes…
the real problem is simply:
The Query Itself
And this happens a lot.
Because users rarely ask perfect, search-friendly questions.
They ask things like:
“How does it work?”
“Which one is better?”
“Can you explain that thing again?”
For humans, this makes sense.
Because we remember the conversation.
But for a retriever…
this can be a disaster.
And this is where:
- Query Rewriting
- Query Expansion
become incredibly powerful.
These techniques help improve the question before retrieval even begins.
And that often improves recall dramatically.
Think of It Like This
Before the system searches the vector database…
it first asks:
“Is this actually the best version of the question?”
If not…
it improves it.
That is query rewriting.
It turns a vague question into a search-friendly one.
And that small step can completely change retrieval quality.
Why This Matters
Let’s take a simple example.
Suppose the user asks:
“How does it work?”
As humans, we understand this from context.
But for the retriever…
this is too vague.
What exactly is:
“it”
Does it mean:
- embeddings?
- chunking?
- ANN search?
- HNSW indexing?
The retriever has no idea.
So before retrieval, the system rewrites it into something clearer.
For example:
“How does ANN search work in a vector database?”
Now retrieval becomes much stronger.
Because the search is precise.
That is the power of:
Query Rewriting
Query Expansion
Now let’s look at another powerful technique:
Query Expansion
This works a little differently.
Instead of rewriting the question…
we expand it.
What Does Expansion Mean?
It means adding:
- related words
- synonyms
- alternate phrasing
- domain-specific variations
to improve retrieval coverage.
Because sometimes the exact words do not match…
but the meaning does.
And retrieval should still work.
A Simple Example
Suppose the user searches:
“salary hike”
The system may expand that into:
- pay raise
- compensation increase
- salary increment
Now the retriever can match documents using different language.
Even if the exact words are different.
That dramatically improves:
Recall
Because now fewer useful documents are missed.
And in RAG:
better recall = better answers
Clarifying Missing Context
Sometimes the system also needs to add missing context.
For example, the user asks:
“best one among these”
That sounds natural in conversation.
But for retrieval…
it is almost useless.
Best what?
Among which options?
So the system uses conversation history to clarify.
It may rewrite it as:
“Best vector database among Pinecone, Milvus, and Qdrant”
Now the retriever understands exactly what needs to be searched.
That makes retrieval dramatically more precise.
This is incredibly important in:
- chatbots
- enterprise copilots
- conversational RAG systems
where users naturally ask follow-up questions.
LLM-Based Query Rewriting
Now here’s where things get really interesting.
Modern RAG systems often use the LLM itself to rewrite the query.
Yes—
the same model used for answering…
can also help improve retrieval.
And this works surprisingly well.
A Common Prompt
The system may prompt the model like this:
Rewrite the following conversational question into a clear search query
For example:
“Can you tell me how this indexing thing works?”
becomes:
“How does HNSW indexing work in vector databases?”
That rewritten query is then sent for retrieval.
This simple step can significantly improve recall.
Because now the retriever is searching with clarity.
Self-Querying & Metadata-Aware Retrieval
So far, we’ve looked at retrieval as something fairly straightforward:
- user asks a question
- retriever finds relevant chunks
- LLM generates an answer
Simple.
But modern RAG systems are becoming much smarter than that.
This is where retrieval starts becoming:
Agentic
And one of the most exciting ideas in this space is:
Self-Querying
This is one of the biggest shifts happening in RAG today.
Because now…
the model is not just answering.
It is actively deciding:
what information it still needs
And that is incredibly powerful.
What Is Self-Querying?
Let’s imagine a user asks:
“Tell me about RAG performance optimization.”
The system performs retrieval.
It gets some chunks.
It starts generating an answer.
But while answering, the LLM realizes:
“Wait… I still need more information about reranking or chunking strategies.”
Instead of stopping there…
it creates its own follow-up question.
Something like:
“What are common reranking techniques in RAG?”
That new query is then sent back into the retrieval system.
More chunks are fetched.
And the answer becomes better.
That is:
Self-Querying
The Big Idea
The model is essentially asking itself:
What information am I still missing?
And then retrieving that information on its own.
This process can continue iteratively.
Step by step.
Like the model is planning its own research path.
This is often called:
- self-RAG
- agentic retrieval
- autonomous retrieval planning
Because the model is no longer passive.
It is actively thinking.
That is a major leap forward.
A Simple Real-World Analogy
Think of doing research yourself.
You start with one question.
Then while reading…
you realize:
“I need to understand this part better.”
So you search again.
Then again.
And again.
That is exactly what the model is doing.
It is not just answering.
It is investigating.
That makes responses:
- richer
- deeper
- more complete
Especially for complex multi-step problems.
Metadata-Aware Retrieval
Now let’s move to another extremely important concept:
Metadata-Aware Retrieval
In real enterprise RAG systems…
we rarely store just plain text.
Every chunk usually comes with metadata.
And that metadata can be just as important as the content itself.
What Metadata Looks Like
Metadata can include things like:
- source document
- publication date
- author
- category
- department
- access permissions
- confidence level
- version number
This helps the system answer not just:
“What is relevant?”
but also:
“What is trustworthy?”
“What is latest?”
“What is allowed?”
That changes everything.
Why Metadata Matters
Let’s take a simple example.
Suppose the user asks:
“What is the latest company leave policy?”
Now imagine your database contains:
- one version from 2022
- one from 2024
- one from 2026
Clearly…
we want the most recent one.
Not the outdated policy.
This is where metadata-aware retrieval becomes critical.
The retriever can prioritize:
newer documents first
Or even apply a strict rule like:
only retrieve documents after 2025
That dramatically improves answer quality.
Especially in enterprise systems.
Boosting & filtering
Most modern vector databases support something very powerful:
Boolean Metadata Filters
For example:
category = HR
date > 2025
source = internal docs
This allows retrieval to become highly precise.
Instead of searching everything…
the system searches only what matters.
That improves:
- trust
- compliance
- answer relevance
- production safety
This is especially critical in:
- legal systems
- finance
- healthcare
- enterprise knowledge platforms
where wrong information can be expensive.
Hybrid Scoring
Some advanced systems go even further.
They combine multiple scores.
For example:
- similarity score
- freshness score
- trust score
- source priority score
This creates:
Hybrid Scoring
Example:
highly relevant + recently updated should rank higher
while:
relevant but outdated should rank lower
This helps prevent one of the biggest enterprise problems:
outdated answers
And that is incredibly important for production AI.
Hybrid Retrieval & Reranking
Now let’s talk about how modern RAG systems improve retrieval quality even further.
Because in production…
relying on just one retrieval method is often not enough.
And this is exactly where:
- Hybrid Retrieval
- Reranking
come in.
Honestly…
this is where average RAG systems become exceptional.
Because retrieval quality decides answer quality.
And these two techniques make that retrieval dramatically stronger.
Part 1 — Hybrid Retrieval
The idea is beautifully simple:
Instead of using only one search strategy… use the strengths of multiple ones
Because no single retrieval method is perfect.
And production systems know that.
Sparse Retrieval
The first half of hybrid retrieval is:
Sparse Retrieval
This is the classic keyword-based approach.
The most common example is:
BM25
This method is excellent at:
exact word matching
That makes it extremely useful for things like:
- product IDs
- invoice numbers
- error codes
- exact dates
- legal references
- precise names
For example, if the user searches:
error code 503
keyword search can often find the exact document instantly.
That is its superpower.
Dense Retrieval
This is embedding-based retrieval.
Instead of exact words…
it uses:
vector similarity
This is also called:
Semantic Search
Because it focuses on meaning.
Not exact phrasing.
Why We Combine Both
Now here’s the problem.
Keyword Search Can Miss Meaning
It struggles with:
- synonyms
- paraphrases
- natural language variation
Vector Search Can Miss Exact Terms
It sometimes struggles with:
- IDs
- dates
- exact product names
- specific codes
So instead of choosing one…
we combine both.
That is:
Hybrid Retrieval
How Hybrid Retrieval Works
The system runs:
- BM25 search
- Vector search
at the same time.
In parallel.
Then it merges the results.
This gives us both:
exact matching + semantic matching
That combination is incredibly powerful.
Because now retrieval becomes much more complete.
Reciprocal Rank Fusion (RRF)
Now comes the next important question:
How do we merge results from two different retrieval systems?
This is where:
Reciprocal Rank Fusion (RRF)
comes in.
And this is one of the smartest ranking tricks in production retrieval.
How RRF Works
Instead of relying only on raw similarity scores…
RRF focuses on:
rank positions
If a chunk ranks highly in:
- BM25
- vector search
- reranking systems
its final importance increases.
Because multiple systems agree it is valuable.
That makes ranking stronger.
And more reliable.
This improves:
- diversity
- relevance
- retrieval quality
without needing a perfect single retriever.
That is why RRF is widely used in high-quality search systems.
Part 2 — Reranking
Now after retrieval…
we move to the next powerful step:
Reranking
This is where precision becomes even stronger.
What Happens First?
Let’s say the initial retriever fetches:
- top 50 chunks
- top 100 chunks
These are good candidates.
But not all of them are equally useful.
Some are much better than others.
Now we need a smarter filter.
That is the job of the:
Reranker
How Reranking Works
The reranker evaluates each pair:
query + chunk
and scores:
how truly relevant this chunk is
This is much more accurate than simple vector similarity.
Because now the model looks at deeper contextual meaning.
Not just embedding distance.
Cross-Encoder Rerankers
A very common reranker is a:
Cross-Encoder
This is one of the most powerful retrieval upgrades in RAG.
Common Models Used
Many rerankers use models based on:
RoBERTa
And in modern systems…
even small LLMs can act as rerankers.
That flexibility is very powerful.
Why Reranking Matters So Much
This final filter step dramatically improves:
the quality of context sent to the LLM
And that directly improves:
final answer quality
Because in RAG:
the model can only answer based on what gets retrieved
Better retrieval → Better prompts → Better answers
Always.
Advanced Retrieval Techniques
So far, we’ve explored:
- vector search
- hybrid retrieval
- reranking
- multi-hop retrieval
- self-querying
And all of these are powerful.
But now we move into something even more exciting.
This is where retrieval goes beyond simple search…
and starts becoming:
Intelligent Reasoning
These are the techniques used in cutting-edge RAG systems.
The kinds of systems powering:
- enterprise copilots
- autonomous AI agents
- research assistants
- complex decision-support platforms
Let’s explore them.
1. Graph RAG
Let’s start with one of the most powerful approaches:
Graph RAG
This is especially useful for:
multi-hop reasoning
and complex fact chaining.
Because sometimes the answer is not inside one chunk.
It exists across connected relationships.
The Core Idea
Traditional RAG treats documents like isolated chunks.
Each chunk is retrieved mostly through similarity.
But Graph RAG works differently.
It transforms information into:
a graph of entities and relationships
Instead of isolated text…
we now have connected knowledge.
Think of It Like a Network
For example:
- RAG → uses → embeddings
- embeddings → stored in → vector database
- vector database → uses → ANN search
This creates a connected web of knowledge.
Now the system can do more than search.
It can:
traverse relationships
Almost like following links inside a knowledge network.
That is incredibly powerful.
Why This Matters
Let’s take a question like:
“Who is the CEO of the company that acquired GitHub?”
This requires multiple connected facts.
The system must first know:
- who acquired GitHub
Then:
- who leads that company
This is hard for simple vector retrieval.
But Graph RAG handles it beautifully.
Because it retrieves through:
entity connections
Not just semantic similarity.
That is a major upgrade.
2. Agentic Retrieval
Now let’s talk about one of the most futuristic concepts in RAG:
Agentic Retrieval
This is where retrieval starts feeling like real autonomous research.
The Big Shift
Instead of only querying a vector database…
the system can decide:
Which tool should I use next?
That changes everything.
Because now retrieval becomes:
planning + decision-making
not just search.
Multiple Specialized Agents
For example:
- one agent performs web search
- one queries an internal database
- one checks a knowledge graph
- one retrieves from company documents
Each tool has a different strength.
The AI decides which one is best for the current problem.
That makes the system feel much more like a researcher.
Not just a chatbot.
This is becoming extremely important in:
- enterprise copilots
- autonomous workflows
- advanced AI assistants
because real-world questions rarely live in one place.
3. Recursive Retrieval
Next comes:
Recursive Retrieval
This is one of the smartest ideas in advanced RAG.
Because here…
the LLM uses its own output to improve the next retrieval step.
How It Works
Let’s say the system gives an initial answer.
While generating that answer, it realizes:
“I still need more information about this specific part.”
Instead of stopping…
it creates a new search query based on its own response.
That new query is sent back to the retriever.
More context is fetched.
And the answer improves.
This process can repeat multiple times.
Like the model is refining its own research.
That is recursive retrieval.
Why It Feels Powerful
It feels almost like the model is thinking:
“I’m not done yet. I need one more piece of evidence.”
That creates much deeper and more complete answers.
This is closely related to:
- multi-hop retrieval
- self-querying
- agentic RAG
And it is extremely powerful for complex reasoning tasks.
4. Knowledge Graph Integration
Now let’s look at one of the strongest hybrid architectures in modern RAG.
Some advanced systems combine:
- Vector Databases
- Graph Databases
This gives the best of both worlds.
How This Architecture Works
Step 1
Vector search retrieves the most relevant entities or chunks.
For example:
retrieve entity → RAG
Step 2
A graph database then explores connected facts.
For example:
- uses → embeddings
- depends on → vector DB
- improves → LLM accuracy
This creates:
semantic retrieval + relational reasoning
And that is incredibly powerful.
Because now the system understands both:
- similarity
- relationships
Not just one.
Advanced Querying
Some systems even use query languages like:
SPARQL
This allows highly structured graph queries.
Especially useful for:
- enterprise knowledge systems
- compliance platforms
- healthcare retrieval
- legal research systems
where relationships matter deeply.