Published on

Retrieval System

Course: Everything about Retrieval Augmented Generation (RAG)

Authors

Retriever Architecture

Now let’s talk about one of the most important parts of any production RAG system:

Retriever Architecture

If embeddings are the language…

and vector databases are the storage…

then the retriever is the brain.

Because this is the part that decides:

What information should the model actually read?

And that decision changes everything.

Retrieval Is Usually Not One Step

Many beginners imagine retrieval like this:

user asks a question → system finds the answer

But in real-world RAG systems…

retrieval is usually much smarter.

It typically happens in:

Two Stages

And this dual-stage design is what gives us both:

speed and accuracy

Because production systems need both.

Fast systems without accuracy are useless.

Accurate systems without speed are unusable.

So we combine both.

Let’s break it down.

Stage 1 — Fast Recall

The first stage is all about:

Speed

Its only job is to quickly fetch a large pool of potentially relevant chunks.

This is called:

The Recall Stage

At this point, we do not need perfect ranking.

We just need one thing:

Don’t miss useful chunks

That is the goal.

How This Stage Works

This stage usually uses something called a:

Bi-Encoder Retriever

Here’s the idea:

  • the user query is converted into an embedding
  • all document chunks already have embeddings

Both are encoded independently.

Then the system performs:

ANN Search

to find the nearest vectors.

Because this is embedding-based retrieval, it is extremely fast.

This is exactly where vector databases like:

  • Pinecone
  • Milvus
  • Qdrant

come into play.

What Gets Retrieved?

Typically, this stage retrieves something like:

  • top 50 chunks
  • top 100 chunks
  • sometimes even top 200 chunks

Why so many?

Because the idea is simple:

retrieve a superset of potentially useful information

At this stage, it is okay if some irrelevant chunks sneak in.

Because speed matters most.

Precision comes later.

Stage 2 — Accurate Reranking

Now Accuracy Takes Over

Once we have the candidate chunks…

the next question becomes:

Which ones are actually the best?

This is where the second stage begins.

And this stage is all about:

Accuracy

The Role of the Reranker

This stage uses something called a:

Reranker

Its job is to take the top candidate chunks and rank them properly.

Because not every retrieved chunk is equally useful.

Some are much better than others.

We need the best ones.

The most common reranker is a:

Cross-Encoder

This is more powerful than a bi-encoder.

Bi-Encoder vs Cross-Encoder

Bi-Encoder

Query and document are encoded separately

→ fast

Cross-Encoder

Query and document are processed together

→ slower, but much more accurate

Because now the model can deeply understand:

the actual relationship between the query and the chunk

Not just vector similarity.

But real contextual relevance.

That makes reranking much stronger.

Why We Don’t Use Cross-Encoder Everywhere

Because it is slower.

Much slower.

Running it across millions of documents would be too expensive.

So we only use it on:

the top few candidates

Usually after Stage 1 narrows things down.

That gives us:

fast recall + accurate ranking

And that is exactly what production RAG systems need.

Multi-Query Retrieval

Now some advanced systems go one step further.

They use something called:

Multi-Query Retrieval

This is incredibly useful for complex questions.

A Real Example

Suppose the user asks:

“How does RAG improve accuracy and reduce hallucinations?”

This is not really one question.

It contains multiple ideas.

For example:

  • how RAG improves accuracy
  • how RAG reduces hallucination
  • how retrieval quality affects final answers

If we search using only one query, we may miss important information.

So instead…

the system breaks it into multiple sub-queries.

Each one retrieves different chunks.

That improves:

  • retrieval coverage
  • context diversity
  • final answer quality

Especially in enterprise search systems.

Top-K Retrieval & Thresholding

Now let’s talk about a very important retrieval question:

How many chunks should we actually retrieve?

Because once the vector database finds similar embeddings…

the job is still not finished.

We still need to decide:

How much context should be sent to the LLM?

And this decision matters a lot.

Too little context…

and the model may miss important information.

Too much context…

and the prompt becomes noisy, expensive, and less accurate.

This is where two important strategies come in:

  • Top-K Retrieval
  • Threshold-Based Retrieval

These are small decisions…

but they have a huge impact on answer quality.

What Is Top-K Retrieval?

This simply means:

Return the top K most similar chunks

That’s it.

Very simple.

Very common.

A Few Examples

For example:

  • Top 5
  • Top 10
  • Top 20

If we set:

k=10

the system returns the:

10 nearest neighbors

based on similarity score.

These are the chunks considered most semantically relevant to the user’s query.

This is the default approach in most RAG systems.

Because it is:

  • simple
  • predictable
  • easy to tune

Let’s say a user asks:

“What is semantic chunking?”

The system may retrieve:

the top 10 most relevant chunks

related to chunking, retrieval, and semantic segmentation.

Those chunks are then added to the prompt.

And the LLM uses them to generate the final answer.

That is standard Top-K retrieval.

Why Choosing K Matters

Now here comes the important part.

Choosing the value of K is not random.

It should be based on real engineering decisions.

And usually, two major factors matter most.

Factor 1 — LLM Context Budget

The first factor is:

Context Window Size

Every LLM has a limit on how much text it can process in a single prompt.

This is called the:

context window

If you retrieve too many chunks…

you may:

  • exceed the model’s limit
  • increase token cost
  • flood the prompt with irrelevant content

And surprisingly…

more context does not always mean better answers.

Sometimes too much context actually reduces quality.

Because the important signal gets buried inside noise.

That is why K must be chosen carefully.

Factor 2 — Query Complexity

The second factor is:

Query Complexity

Not every question needs the same amount of context.

A simple question may only need:

3 to 5 chunks

But a complex, multi-part question may need much more.

For example:

“Compare Pinecone, Weaviate, and Milvus for enterprise RAG systems”

This question touches:

  • scalability
  • cost
  • deployment
  • filtering
  • infrastructure choices

That likely requires a larger retrieval set.

So sometimes:

bigger question = bigger K

And that makes perfect sense.

Threshold-Based Retrieval

Now let’s look at another strategy.

Instead of choosing a fixed K…

some systems use:

Similarity Thresholding

This works differently.

How Thresholding Works

Instead of saying:

Always return 10 chunks

we say:

Return every chunk above a certain similarity score

For example:

Cosine Similarity>0.85

That means:

Only retrieve chunks that are truly relevant.

No unnecessary filler.

This creates a much more dynamic retrieval system.

Why This Is Powerful

Sometimes only:

3 chunks

are highly relevant.

And sometimes:

15 chunks

deserve to be included.

Thresholding adapts automatically.

That flexibility can improve retrieval quality significantly.

Especially when similarity scores are reliable.

This makes the system smarter than fixed Top-K alone.

A Simple Way to Remember It

Think of it like this:

Top-K says:

Always give me the best 10

Thresholding says:

Give me everything that is truly relevant

Both are valid.

Both are useful.

The best choice depends on your use case.

Multi-Hop Retrieval

So far, most of our retrieval examples looked simple:

user asks a question system retrieves relevant chunks LLM generates the answer

Clean.

Fast.

Straightforward.

But in real-world systems…

not every question is that simple.

Some questions cannot be answered in a single retrieval step.

They require:

multiple connected retrievals

And this is where we enter a much more advanced concept:

Multi-Hop Retrieval

This is one of the most powerful ideas in advanced RAG systems.

Think of It Like Solving a Puzzle

Imagine solving a puzzle.

You don’t get the final answer from one piece.

You first find:

  • piece one

which helps you locate:

  • piece two

and only after connecting both…

you understand the full picture.

That is exactly how multi-hop retrieval works.

Instead of one retrieval…

the system performs retrieval in stages.

One answer leads to the next search.

That is why it is called:

multi-hop

Because the system “hops” from one piece of knowledge to another.

A Simple Example

Let’s take a question like:

“Where is Bengaluru, and what is its population?”

At first glance, this looks like one question.

But actually…

it requires two steps.

First Hop

The system first needs to retrieve:

Where is Bengaluru located?

The answer might be:

Karnataka, India

Second Hop

Now, using that context, the system performs the next retrieval:

What is the population of Bengaluru?

Only after both steps are connected…

can the final answer be complete.

That is multi-hop retrieval.

How It Works Internally

So how does the system actually do this?

There are usually two major approaches.

Approach 1 — Iterative Vector DB Queries

The first method is:

Iterative Querying

This is the most intuitive approach.

How It Works

  1. Query the vector database once
  2. Use that result to generate the next query
  3. Query again
  4. Repeat until enough context is gathered

It works like following breadcrumbs.

Each answer leads to the next search.

This approach is simple and very effective.

Especially for structured reasoning tasks.

Approach 2 — Chained LLM Calls

The second method is more advanced.

Here we use:

Chained LLM Calls

This often involves:

Chain-of-Thought Reasoning

Instead of manually deciding the next retrieval step…

the LLM helps plan it.

How This Works

  1. First retrieved chunk is passed to the LLM
  2. The LLM decides what information is missing
  3. It generates the next retrieval query
  4. Retrieval happens again
  5. The cycle continues step by step

This is very common in:

Agentic RAG Systems

Here, the model actively decides its own retrieval path.

That makes the system far more powerful.

Especially for complex enterprise workflows.

Why Multi-Hop Retrieval Matters

This becomes extremely important for:

  • reasoning-heavy tasks
  • fact chaining
  • enterprise document linking
  • legal research
  • knowledge graph style questions
  • investigative search systems

Because many real-world questions are not answerable from a single chunk.

They require connected reasoning across multiple documents.

And simple one-step retrieval is not enough.

Query Rewriting & Expansion

Now here’s something incredibly important in RAG that many beginners completely miss.

Sometimes…

the problem is not:

  • the vector database
  • the embeddings
  • the retriever
  • the chunking strategy

Sometimes…

the real problem is simply:

The Query Itself

And this happens a lot.

Because users rarely ask perfect, search-friendly questions.

They ask things like:

“How does it work?”

“Which one is better?”

“Can you explain that thing again?”

For humans, this makes sense.

Because we remember the conversation.

But for a retriever…

this can be a disaster.

And this is where:

  • Query Rewriting
  • Query Expansion

become incredibly powerful.

These techniques help improve the question before retrieval even begins.

And that often improves recall dramatically.

Think of It Like This

Before the system searches the vector database…

it first asks:

“Is this actually the best version of the question?”

If not…

it improves it.

That is query rewriting.

It turns a vague question into a search-friendly one.

And that small step can completely change retrieval quality.

Why This Matters

Let’s take a simple example.

Suppose the user asks:

“How does it work?”

As humans, we understand this from context.

But for the retriever…

this is too vague.

What exactly is:

“it”

Does it mean:

  • embeddings?
  • chunking?
  • ANN search?
  • HNSW indexing?

The retriever has no idea.

So before retrieval, the system rewrites it into something clearer.

For example:

“How does ANN search work in a vector database?”

Now retrieval becomes much stronger.

Because the search is precise.

That is the power of:

Query Rewriting

Query Expansion

Now let’s look at another powerful technique:

Query Expansion

This works a little differently.

Instead of rewriting the question…

we expand it.

What Does Expansion Mean?

It means adding:

  • related words
  • synonyms
  • alternate phrasing
  • domain-specific variations

to improve retrieval coverage.

Because sometimes the exact words do not match…

but the meaning does.

And retrieval should still work.

A Simple Example

Suppose the user searches:

“salary hike”

The system may expand that into:

  • pay raise
  • compensation increase
  • salary increment

Now the retriever can match documents using different language.

Even if the exact words are different.

That dramatically improves:

Recall

Because now fewer useful documents are missed.

And in RAG:

better recall = better answers

Clarifying Missing Context

Sometimes the system also needs to add missing context.

For example, the user asks:

“best one among these”

That sounds natural in conversation.

But for retrieval…

it is almost useless.

Best what?

Among which options?

So the system uses conversation history to clarify.

It may rewrite it as:

“Best vector database among Pinecone, Milvus, and Qdrant”

Now the retriever understands exactly what needs to be searched.

That makes retrieval dramatically more precise.

This is incredibly important in:

  • chatbots
  • enterprise copilots
  • conversational RAG systems

where users naturally ask follow-up questions.

LLM-Based Query Rewriting

Now here’s where things get really interesting.

Modern RAG systems often use the LLM itself to rewrite the query.

Yes—

the same model used for answering…

can also help improve retrieval.

And this works surprisingly well.

A Common Prompt

The system may prompt the model like this:

Rewrite the following conversational question into a clear search query

For example:

“Can you tell me how this indexing thing works?”

becomes:

“How does HNSW indexing work in vector databases?”

That rewritten query is then sent for retrieval.

This simple step can significantly improve recall.

Because now the retriever is searching with clarity.

Self-Querying & Metadata-Aware Retrieval

So far, we’ve looked at retrieval as something fairly straightforward:

  • user asks a question
  • retriever finds relevant chunks
  • LLM generates an answer

Simple.

But modern RAG systems are becoming much smarter than that.

This is where retrieval starts becoming:

Agentic

And one of the most exciting ideas in this space is:

Self-Querying

This is one of the biggest shifts happening in RAG today.

Because now…

the model is not just answering.

It is actively deciding:

what information it still needs

And that is incredibly powerful.

What Is Self-Querying?

Let’s imagine a user asks:

“Tell me about RAG performance optimization.”

The system performs retrieval.

It gets some chunks.

It starts generating an answer.

But while answering, the LLM realizes:

“Wait… I still need more information about reranking or chunking strategies.”

Instead of stopping there…

it creates its own follow-up question.

Something like:

“What are common reranking techniques in RAG?”

That new query is then sent back into the retrieval system.

More chunks are fetched.

And the answer becomes better.

That is:

Self-Querying

The Big Idea

The model is essentially asking itself:

What information am I still missing?

And then retrieving that information on its own.

This process can continue iteratively.

Step by step.

Like the model is planning its own research path.

This is often called:

  • self-RAG
  • agentic retrieval
  • autonomous retrieval planning

Because the model is no longer passive.

It is actively thinking.

That is a major leap forward.

A Simple Real-World Analogy

Think of doing research yourself.

You start with one question.

Then while reading…

you realize:

“I need to understand this part better.”

So you search again.

Then again.

And again.

That is exactly what the model is doing.

It is not just answering.

It is investigating.

That makes responses:

  • richer
  • deeper
  • more complete

Especially for complex multi-step problems.

Metadata-Aware Retrieval

Now let’s move to another extremely important concept:

Metadata-Aware Retrieval

In real enterprise RAG systems…

we rarely store just plain text.

Every chunk usually comes with metadata.

And that metadata can be just as important as the content itself.

What Metadata Looks Like

Metadata can include things like:

  • source document
  • publication date
  • author
  • category
  • department
  • access permissions
  • confidence level
  • version number

This helps the system answer not just:

“What is relevant?”

but also:

“What is trustworthy?”

“What is latest?”

“What is allowed?”

That changes everything.

Why Metadata Matters

Let’s take a simple example.

Suppose the user asks:

“What is the latest company leave policy?”

Now imagine your database contains:

  • one version from 2022
  • one from 2024
  • one from 2026

Clearly…

we want the most recent one.

Not the outdated policy.

This is where metadata-aware retrieval becomes critical.

The retriever can prioritize:

newer documents first

Or even apply a strict rule like:

only retrieve documents after 2025

That dramatically improves answer quality.

Especially in enterprise systems.

Boosting & filtering

Most modern vector databases support something very powerful:

Boolean Metadata Filters

For example:

category = HR
date > 2025
source = internal docs

This allows retrieval to become highly precise.

Instead of searching everything…

the system searches only what matters.

That improves:

  • trust
  • compliance
  • answer relevance
  • production safety

This is especially critical in:

  • legal systems
  • finance
  • healthcare
  • enterprise knowledge platforms

where wrong information can be expensive.

Hybrid Scoring

Some advanced systems go even further.

They combine multiple scores.

For example:

  • similarity score
  • freshness score
  • trust score
  • source priority score

This creates:

Hybrid Scoring

Example:

highly relevant + recently updated should rank higher

while:

relevant but outdated should rank lower

This helps prevent one of the biggest enterprise problems:

outdated answers

And that is incredibly important for production AI.

Hybrid Retrieval & Reranking

Now let’s talk about how modern RAG systems improve retrieval quality even further.

Because in production…

relying on just one retrieval method is often not enough.

And this is exactly where:

  • Hybrid Retrieval
  • Reranking

come in.

Honestly…

this is where average RAG systems become exceptional.

Because retrieval quality decides answer quality.

And these two techniques make that retrieval dramatically stronger.

Part 1 — Hybrid Retrieval

The idea is beautifully simple:

Instead of using only one search strategy… use the strengths of multiple ones

Because no single retrieval method is perfect.

And production systems know that.

Sparse Retrieval

The first half of hybrid retrieval is:

Sparse Retrieval

This is the classic keyword-based approach.

The most common example is:

BM25

This method is excellent at:

exact word matching

That makes it extremely useful for things like:

  • product IDs
  • invoice numbers
  • error codes
  • exact dates
  • legal references
  • precise names

For example, if the user searches:

error code 503

keyword search can often find the exact document instantly.

That is its superpower.

Dense Retrieval

This is embedding-based retrieval.

Instead of exact words…

it uses:

vector similarity

This is also called:

Semantic Search

Because it focuses on meaning.

Not exact phrasing.

Why We Combine Both

Now here’s the problem.

Keyword Search Can Miss Meaning

It struggles with:

  • synonyms
  • paraphrases
  • natural language variation

Vector Search Can Miss Exact Terms

It sometimes struggles with:

  • IDs
  • dates
  • exact product names
  • specific codes

So instead of choosing one…

we combine both.

That is:

Hybrid Retrieval

How Hybrid Retrieval Works

The system runs:

  • BM25 search
  • Vector search

at the same time.

In parallel.

Then it merges the results.

This gives us both:

exact matching + semantic matching

That combination is incredibly powerful.

Because now retrieval becomes much more complete.

Reciprocal Rank Fusion (RRF)

Now comes the next important question:

How do we merge results from two different retrieval systems?

This is where:

Reciprocal Rank Fusion (RRF)

comes in.

And this is one of the smartest ranking tricks in production retrieval.

How RRF Works

Instead of relying only on raw similarity scores…

RRF focuses on:

rank positions

If a chunk ranks highly in:

  • BM25
  • vector search
  • reranking systems

its final importance increases.

Because multiple systems agree it is valuable.

That makes ranking stronger.

And more reliable.

This improves:

  • diversity
  • relevance
  • retrieval quality

without needing a perfect single retriever.

That is why RRF is widely used in high-quality search systems.

Part 2 — Reranking

Now after retrieval…

we move to the next powerful step:

Reranking

This is where precision becomes even stronger.

What Happens First?

Let’s say the initial retriever fetches:

  • top 50 chunks
  • top 100 chunks

These are good candidates.

But not all of them are equally useful.

Some are much better than others.

Now we need a smarter filter.

That is the job of the:

Reranker

How Reranking Works

The reranker evaluates each pair:

query + chunk

and scores:

how truly relevant this chunk is

This is much more accurate than simple vector similarity.

Because now the model looks at deeper contextual meaning.

Not just embedding distance.

Cross-Encoder Rerankers

A very common reranker is a:

Cross-Encoder

This is one of the most powerful retrieval upgrades in RAG.

Common Models Used

Many rerankers use models based on:

RoBERTa

And in modern systems…

even small LLMs can act as rerankers.

That flexibility is very powerful.

Why Reranking Matters So Much

This final filter step dramatically improves:

the quality of context sent to the LLM

And that directly improves:

final answer quality

Because in RAG:

the model can only answer based on what gets retrieved

Better retrieval → Better prompts → Better answers

Always.

Advanced Retrieval Techniques

So far, we’ve explored:

  • vector search
  • hybrid retrieval
  • reranking
  • multi-hop retrieval
  • self-querying

And all of these are powerful.

But now we move into something even more exciting.

This is where retrieval goes beyond simple search…

and starts becoming:

Intelligent Reasoning

These are the techniques used in cutting-edge RAG systems.

The kinds of systems powering:

  • enterprise copilots
  • autonomous AI agents
  • research assistants
  • complex decision-support platforms

Let’s explore them.

1. Graph RAG

Let’s start with one of the most powerful approaches:

Graph RAG

This is especially useful for:

multi-hop reasoning

and complex fact chaining.

Because sometimes the answer is not inside one chunk.

It exists across connected relationships.

The Core Idea

Traditional RAG treats documents like isolated chunks.

Each chunk is retrieved mostly through similarity.

But Graph RAG works differently.

It transforms information into:

a graph of entities and relationships

Instead of isolated text…

we now have connected knowledge.

Mermaid Diagram
Rendering…

Think of It Like a Network

For example:

  • RAG → uses → embeddings
  • embeddings → stored in → vector database
  • vector database → uses → ANN search

This creates a connected web of knowledge.

Now the system can do more than search.

It can:

traverse relationships

Almost like following links inside a knowledge network.

That is incredibly powerful.

Why This Matters

Let’s take a question like:

“Who is the CEO of the company that acquired GitHub?”

This requires multiple connected facts.

The system must first know:

  • who acquired GitHub

Then:

  • who leads that company

This is hard for simple vector retrieval.

But Graph RAG handles it beautifully.

Because it retrieves through:

entity connections

Not just semantic similarity.

That is a major upgrade.

2. Agentic Retrieval

Now let’s talk about one of the most futuristic concepts in RAG:

Agentic Retrieval

This is where retrieval starts feeling like real autonomous research.

The Big Shift

Instead of only querying a vector database…

the system can decide:

Which tool should I use next?

That changes everything.

Because now retrieval becomes:

planning + decision-making

not just search.

Mermaid Diagram
Rendering…

Multiple Specialized Agents

For example:

  • one agent performs web search
  • one queries an internal database
  • one checks a knowledge graph
  • one retrieves from company documents

Each tool has a different strength.

The AI decides which one is best for the current problem.

That makes the system feel much more like a researcher.

Not just a chatbot.

This is becoming extremely important in:

  • enterprise copilots
  • autonomous workflows
  • advanced AI assistants

because real-world questions rarely live in one place.

3. Recursive Retrieval

Next comes:

Recursive Retrieval

This is one of the smartest ideas in advanced RAG.

Because here…

the LLM uses its own output to improve the next retrieval step.

How It Works

Let’s say the system gives an initial answer.

While generating that answer, it realizes:

“I still need more information about this specific part.”

Instead of stopping…

it creates a new search query based on its own response.

That new query is sent back to the retriever.

More context is fetched.

And the answer improves.

This process can repeat multiple times.

Like the model is refining its own research.

That is recursive retrieval.

Mermaid Diagram
Rendering…

Why It Feels Powerful

It feels almost like the model is thinking:

“I’m not done yet. I need one more piece of evidence.”

That creates much deeper and more complete answers.

This is closely related to:

  • multi-hop retrieval
  • self-querying
  • agentic RAG

And it is extremely powerful for complex reasoning tasks.

4. Knowledge Graph Integration

Now let’s look at one of the strongest hybrid architectures in modern RAG.

Some advanced systems combine:

  • Vector Databases
  • Graph Databases

This gives the best of both worlds.

How This Architecture Works

Step 1

Vector search retrieves the most relevant entities or chunks.

For example:

retrieve entity → RAG

Step 2

A graph database then explores connected facts.

For example:

  • uses → embeddings
  • depends on → vector DB
  • improves → LLM accuracy

This creates:

semantic retrieval + relational reasoning

And that is incredibly powerful.

Because now the system understands both:

  • similarity
  • relationships

Not just one.

Mermaid Diagram
Rendering…

Advanced Querying

Some systems even use query languages like:

SPARQL

This allows highly structured graph queries.

Especially useful for:

  • enterprise knowledge systems
  • compliance platforms
  • healthcare retrieval
  • legal research systems

where relationships matter deeply.