Published on

Data Preparation

Course: Everything about Retrieval Augmented Generation (RAG)

Authors

Before we even think about:

  • embeddings
  • vector databases
  • retrieval pipelines
  • semantic search

there is one step that many people skip.

And surprisingly…

it can completely make or break your RAG system.

That step is:

Text Cleaning

It may not sound exciting.

It’s not flashy like vector search.

It doesn’t feel as advanced as embeddings.

But in production systems, this step is absolutely critical.

Think of It Like This

Imagine building a beautiful house.

You can have the best design…

the strongest walls…

the smartest architecture…

But if your foundation is weak, everything becomes unstable.

Text cleaning works the same way.

If you feed messy, broken, or noisy text into your pipeline…

your embeddings will also be messy.

And messy embeddings lead to:

  • poor retrieval
  • irrelevant search results
  • weak semantic matching
  • bad final answers

Which means your RAG system becomes unreliable.

So before retrieval starts…

clean data must come first.

Why Cleaning Matters

Let’s say you extract text from:

  • a PDF
  • a website
  • an HTML page
  • internal documentation
  • scanned reports

The raw output is often terrible.

It rarely looks clean.

Instead, you may see things like:

  • HTML tags
  • page headers and footers
  • repeated boilerplate text
  • random line breaks
  • strange spacing
  • broken words caused by PDF formatting

And all of this creates noise.

Noise is dangerous for embeddings.

The First Goal

So the very first goal is simple:

Remove unnecessary noise Keep only meaningful content

That’s the heart of text cleaning.

We want the embedding model to learn from useful information…

not from formatting mistakes.

What Usually Gets Removed

Typical cleaning includes removing things like:

  • HTML tags
  • navigation menus
  • ads and banners
  • copyright footers
  • repeated page numbers
  • unnecessary whitespace
  • boilerplate text
  • broken line formatting

These things usually add zero value to retrieval.

But they can heavily damage embedding quality.

A Website Example

Imagine you extract text from a webpage.

You might get something like:

Home | About | Contact | Privacy Policy

Now ask yourself:

Will this help answer user questions?

Usually, no.

This content is irrelevant for retrieval.

We don’t want the embedding model learning website menus.

We want it learning the actual article or knowledge content.

That is what matters.

Sample Text Cleaning code:

import re

def clean_text_for_retrieval(text):
    """
    Basic text cleaning for RAG retrieval:
    - Convert to lowercase
    - Remove special characters
    - Remove extra spaces
    """

    # Convert to lowercase
    text = text.lower()

    # Remove special characters and punctuation
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

    # Remove multiple spaces
    text = re.sub(r"\s+", " ", text).strip()

    return text


# Example usage
raw_text = """
RAG (Retrieval-Augmented Generation) helps LLMs 
retrieve relevant information before generating answers!
"""

cleaned_text = clean_text_for_retrieval(raw_text)

print("Original Text:")
print(raw_text)

print("\nCleaned Text:")
print(cleaned_text)

Text Normalization

Once the obvious noise is removed…

the next important step is:

Text Normalization

This step is all about one simple goal:

Make the text clean, consistent, and reliable

Because even after cleaning, extracted text can still be messy in subtle ways.

And those small issues can quietly damage your retrieval quality.

That’s why normalization matters so much.

What Does Normalization Mean?

Normalization means making the text follow a consistent format.

We want the data to look stable and predictable before it reaches the embedding model.

This usually includes things like:

  • converting everything into proper UTF-8 encoding
  • fixing broken characters
  • normalizing quotes and symbols
  • standardizing whitespace
  • optionally standardizing letter casing

It may sound small…

but in production systems, these details matter a lot.

Why UTF-8 Matters

Sometimes extracted text contains strange symbols like:

or random encoding artifacts caused by PDFs, OCR systems, or bad file conversions.

These broken characters confuse embedding models.

Because the model tries to interpret them as actual text.

And that creates noisy embeddings.

That’s why proper UTF-8 encoding is important.

We want clean, readable text.

Not corrupted symbols.

Because remember:

Embeddings are only as good as the input text

Fixing Symbols and Formatting

Normalization also means fixing things like:

  • smart quotes vs regular quotes
  • inconsistent dashes
  • broken punctuation
  • extra spaces
  • strange line spacing

For example:

AI   systems     are   powerful

should become:

AI systems are powerful

Simple.

Clean.

Consistent.

That consistency improves embedding quality.

Should We Standardize Letter Case?

Sometimes yes.

Sometimes no.

For example:

RAG
rag
Rag

These may be treated differently depending on your pipeline.

Some systems convert everything to lowercase.

Some preserve the original casing for domain-specific meaning.

This depends on your use case.

The goal is not “always lowercase.”

The goal is:

consistent representation

Preserve Metadata

The Step Many Beginners Miss

Now here’s something incredibly important…

and many beginners completely overlook it.

Along with text, you should also store metadata.

Not just content.

Metadata makes your RAG system much smarter.

What Kind of Metadata?

Useful metadata includes things like:

  • author name
  • timestamps
  • source URL
  • document title
  • section heading
  • file name
  • department name
  • update date

This information becomes incredibly powerful during retrieval.

Why Metadata Matters

Without metadata, your system only retrieves text.

With metadata, your system can also explain:

Where that information came from

For example:

Retrieved from company policy document Updated March 2026

That changes everything.

Because now users trust the answer more.

It improves:

  • trust
  • explainability
  • traceability
  • compliance

And in enterprise AI, that is extremely important.

Sample code

import re
import unicodedata

def preprocess_for_rag(text, metadata=None):
    """
    Preprocessing for Retriever in RAG:
    1. UTF-8 safe handling
    2. Fix symbols and formatting
    3. Standardize letter case
    4. Preserve metadata
    """

    # -------------------------------
    # 1. UTF-8 Safe Handling
    # -------------------------------
    # Encode + decode safely to avoid broken characters
    text = text.encode("utf-8", errors="ignore").decode("utf-8")

    # -------------------------------
    # 2. Fix Symbols and Formatting
    # -------------------------------
    # Normalize unicode characters (smart quotes, accents, etc.)
    text = unicodedata.normalize("NFKC", text)

    # Replace multiple spaces/newlines with single space
    text = re.sub(r"\s+", " ", text).strip()

    # Remove unwanted special symbols (keep useful punctuation if needed)
    text = re.sub(r"[^\w\s.,!?-]", "", text)

    # -------------------------------
    # 3. Standardize Letter Case
    # -------------------------------
    text = text.lower()

    # -------------------------------
    # 4. Preserve Metadata
    # -------------------------------
    processed_document = {
        "cleaned_text": text,
        "metadata": metadata if metadata else {}
    }

    return processed_document


# Example usage
raw_text = """
“RAG” (Retrieval-Augmented Generation) improves LLMs 🚀

It helps models retrieve relevant knowledge before answering.
"""

document_metadata = {
    "source": "AI Course Notes",
    "author": "Rohan",
    "topic": "RAG Preprocessing"
}

result = preprocess_for_rag(raw_text, document_metadata)

print("Processed Document:\n")
print(result)

Why Chunking Matters in RAG

One of the Most Important Building Blocks

Now let’s talk about one of the most important building blocks of any RAG pipeline:

Chunking

If embeddings are the brain of retrieval…

chunking is the foundation.

And honestly, many people underestimate how important it is.

But in real-world RAG systems, chunking can completely change the quality of your final answers.

That’s how powerful this step is.

What Is Chunking?

At its core, chunking is actually very simple.

It is the process of taking a large document…

and splitting it into smaller pieces.

These smaller pieces are called:

chunks

That’s it.

Simple idea.

Huge impact.

A Simple Example

Imagine you have a 50-page PDF.

Now ask yourself:

Should we create one giant embedding for the entire document?

Usually…

absolutely not.

That would be:

  • too large
  • too expensive
  • difficult to retrieve from
  • bad for semantic search

Instead, we break that document into smaller, manageable sections.

Each section becomes its own chunk.

This makes the system dramatically more efficient.

What Happens to Each Chunk?

Once the document is split, every chunk can be:

  • individually indexed
  • converted into embeddings
  • stored in a vector database
  • retrieved independently

This is exactly what makes RAG scalable.

Because now the system does not need to search the entire document every time.

It only needs to retrieve the most relevant chunks.

That is a huge improvement.

The Real Goal of Chunking

A great way to think about it is this:

The goal of chunking is to break complex information into smaller, meaningful, retrievable units.

That improves:

  • retrieval accuracy
  • speed
  • scalability
  • cost efficiency

And reduces unnecessary computation.

That is the true purpose of chunking.

Reason 1 — Context Window Constraints

The first major reason chunking is necessary is something called:

Context Window Constraints

This is one of the most practical limitations in modern AI systems.

Models Cannot Read Everything at Once

Both:

  • embedding models
  • large language models (LLMs)

have strict limits on how much text they can process at one time.

This is called the:

context window

You cannot simply take a 100-page document and feed it directly into the model.

It won’t fit.

The model has limits.

And those limits are real engineering constraints.

How Chunking Solves This

Well-sized chunks solve this problem beautifully.

Instead of trying to process the full document, we split it into smaller sections that fit safely inside the model’s limits.

This ensures:

  • embeddings can be generated properly
  • retrieval remains fast
  • prompts stay manageable
  • LLM responses remain grounded

Without chunking, the pipeline simply breaks.

With chunking, it becomes practical.

That’s why this step is not optional.

It is foundational.

Reason 2 — Better Retrieval Efficiency

The second reason is:

Faster and More Accurate Retrieval

This is where chunking starts showing its real power.

Smaller, well-defined chunks make it much easier for the vector database to find the right information quickly.

Think about it like this.

A Simple Analogy

Imagine I ask you:

“Find one sentence about salary policy”

Now you have two choices.

Option 1

Search inside an entire 500-page book.

Option 2

Search inside one specific paragraph.

Obviously…

the second option is much faster.

And much more accurate.

That is exactly how chunking helps RAG.

Instead of searching massive documents, the system searches focused chunks.

That improves both:

  • search speed
  • recall quality

And that leads to better retrieval results.

Reason 3 — Computational Optimization

The third reason is something every production system cares about:

Cost and Performance

This is where engineering decisions become very real.

Because bigger chunks often mean bigger costs.

Why Larger Chunks Become Expensive

Larger chunks usually create:

  • larger embeddings
  • more storage usage
  • slower retrieval
  • higher compute cost

More text means more processing.

And more vectors mean more database load.

That affects:

  • storage cost
  • latency
  • embedding API cost
  • retrieval time

Especially at scale, this becomes a serious problem.

Imagine storing millions of chunks.

Even small inefficiencies become expensive very quickly.

Smart Chunk Sizes Save Money

This is why chunk size matters so much.

A well-designed chunking strategy reduces unnecessary processing.

Not too small.

Not too large.

Just enough to preserve meaning without wasting resources.

That balance is what makes production RAG systems efficient.

Because great architecture is not just about accuracy.

It is also about sustainability.

Reason 4 — Better Semantic Relevance

And finally…

probably the most important reason of all:

Better Relevance

This is where chunking directly affects answer quality.

Retrieval Depends on Meaning

If chunks are split intelligently, each chunk preserves one meaningful idea.

That means when retrieval happens, the system can find:

the exact right context

instead of fragmented or incomplete information.

For example:

A full explanation about embeddings should stay together.

Not split across three unrelated chunks.

Because if the meaning gets broken…

the answer quality drops.

This is why semantic chunking is so powerful.

The Golden Rule of RAG

Here is one of the most important truths in RAG:

Better retrieval = Better generation

The LLM can only answer based on what gets retrieved.

It cannot magically fix missing context.

It cannot invent the right answer reliably.

If the wrong chunk is retrieved…

the final answer becomes weak.

That means:

chunk quality directly impacts response quality

Always.

Filter Bad Chunks

Not Every Chunk Deserves an Embedding

Here’s another important lesson:

Just because text exists does not mean it should become an embedding

Some extracted chunks are simply useless.

And storing them only increases cost while reducing retrieval quality.

Examples of Bad Chunks

These are usually not helpful:

  • standalone numbers
  • random symbols
  • page numbers
  • isolated headings
  • navigation fragments
  • extremely short text pieces

For example:

Page 4

This adds no real semantic value.

It should not become an embedding.

Why Filtering Matters

If you store useless chunks, two bad things happen:

Higher Storage Cost

More vectors = more database size

Which means higher cost.

Worse Retrieval Quality

The system may accidentally retrieve meaningless chunks.

That leads to:

bad context → weak prompts → poor answers

And that hurts the entire RAG pipeline.

So the rule is simple:

Only keep semantically meaningful text

Everything else should be removed.

Text Chunking Strategies

Strategy 1 — Fixed Size Chunking

Let’s start with the most common approach:

Fixed Size Chunking

This is the simplest and most widely used method.

And for many projects, it’s the perfect starting point.

How It Works

Here, we simply decide:

How many tokens should go inside each chunk?

For example:

  • 256 tokens
  • 512 tokens
  • 1024 tokens

The document is then split into chunks of roughly that size.

That’s it.

Simple.

Fast.

Extremely practical.

This is why fixed-size chunking is often the default choice in most RAG systems.

Why It Works So Well

The biggest advantage is simplicity.

You don’t need:

  • complex parsing
  • advanced NLP libraries
  • document structure analysis

Just split by size and move forward.

That makes it computationally cheap and easy to implement.

Especially for:

  • PDFs
  • support documents
  • internal reports
  • knowledge base articles

For many real-world applications, this is often the best place to start.

The Hidden Problem

But there’s one challenge.

Sometimes an important sentence starts at the end of one chunk…

and finishes in the next.

If we split too aggressively, we lose meaning.

That hurts retrieval quality.

And this is where something very important comes in.

Overlap Between Chunks

We usually keep some overlap between chunks.

This is often called a:

Sliding Window

For example:

Chunk 1 = tokens 1 to 500
Chunk 2 = tokens 450 to 950

Notice that some tokens appear in both chunks.

That overlap helps preserve context.

Because meaning often lives across sentence boundaries.

Without overlap, important context may get lost.

With overlap, retrieval becomes much stronger.

Sample code

def fixed_size_chunking(text, chunk_size=100):
    """
    Fixed-size chunking for RAG:
    Splits text into chunks of equal size based on word count.

    Parameters:
    - text: input document
    - chunk_size: number of words per chunk

    Returns:
    - List of text chunks
    """

    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks


# Example usage
sample_text = """
Retrieval-Augmented Generation (RAG) improves large language models
by allowing them to retrieve relevant information before generating answers.
This helps reduce hallucinations and improves response accuracy.
Chunking is one of the most important preprocessing steps in RAG systems.
It helps divide large documents into smaller searchable pieces.
"""

result = fixed_size_chunking(sample_text, chunk_size=10)

print("Fixed Size Chunks:\n")

for idx, chunk in enumerate(result, 1):
    print(f"Chunk {idx}: {chunk}\n")

Strategy 2 — Recursive Chunking

Now let’s move to something smarter.

Recursive Chunking

This is one of the most popular chunking strategies in production-grade RAG systems.

And for good reason.

It balances both:

  • size control
  • semantic continuity

Very effectively.

How Recursive Chunking Works

Instead of splitting blindly by token count…

recursive chunking uses a hierarchy of separators.

Something like this:

  1. First split by headings
  2. Then by paragraphs
  3. Then by sentences
  4. Then by words

The idea is simple:

Try the most natural split first.

If the chunk is still too large, split again using the next level.

And continue recursively until the chunk reaches the target size.

That’s why it is called:

Recursive Text Splitting

Why This Is So Powerful

The chunks may not be exactly identical in size…

but they preserve meaning much better.

Because instead of cutting randomly, the system tries to respect the natural structure of the document.

That gives us the best of both worlds:

size control + semantic continuity

And that is incredibly valuable for retrieval quality.

Sample code

def recursive_chunking(text, max_chunk_size=100):
    """
    Recursive chunking for RAG:
    Splits text hierarchically using paragraphs -> sentences -> words.

    Parameters:
    - text: input document
    - max_chunk_size: maximum words allowed per chunk

    Returns:
    - List of chunks
    """

    def split_recursively(content):
        words = content.split()

        # Base case: already small enough
        if len(words) <= max_chunk_size:
            return [content.strip()]

        # Step 1: Try splitting by paragraph
        if "\n\n" in content:
            parts = content.split("\n\n")
        # Step 2: Try splitting by sentence
        elif "." in content:
            parts = content.split(". ")
        # Step 3: Fallback → split by words
        else:
            parts = [
                " ".join(words[i:i + max_chunk_size])
                for i in range(0, len(words), max_chunk_size)
            ]
            return parts

        chunks = []
        for part in parts:
            if part.strip():
                chunks.extend(split_recursively(part))

        return chunks

    return split_recursively(text)


# Example usage
sample_text = """
RAG systems work better when documents are properly chunked.

Chunking helps break large text into smaller pieces.
These smaller chunks improve retrieval accuracy.

Recursive chunking first tries paragraphs,
then sentences, and finally words if needed.
This preserves meaning better than fixed-size chunking.
"""

result = recursive_chunking(sample_text, max_chunk_size=20)

print("Recursive Chunks:\n")

for idx, chunk in enumerate(result, 1):
    print(f"Chunk {idx}: {chunk}\n")

Strategy 3 — Document-Specific Chunking

Let the Document Structure Guide the Split

This method takes the actual structure of the document into account.

Instead of splitting by arbitrary token counts…

it asks:

How was this document originally organized?

That question makes a huge difference.

Why This Matters

Documents are rarely just random blocks of text.

They usually have structure.

For example:

  • paragraphs
  • headings
  • subsections
  • bullet lists
  • tables

That structure exists for a reason.

It reflects how the author organized meaning.

Document-specific chunking tries to preserve that structure.

Instead of breaking text blindly, it splits based on the logical flow of the content.

This makes retrieval far more meaningful.

A Simple Example

Imagine retrieving:

  • half of a paragraph
  • and half of a heading

combined into one chunk.

That would feel confusing.

The meaning would be broken.

The answer quality would drop.

Document-specific chunking avoids that.

It keeps related content together in the way it was originally written.

That improves coherence significantly.

Where This Works Best

This strategy is especially powerful for structured formats like:

  • Markdown
  • HTML
  • technical documentation
  • product manuals
  • policy documents
  • internal knowledge bases

Because these formats already contain strong structural boundaries.

And using those boundaries improves retrieval quality dramatically.

Sample code

def document_specific_chunking(document, doc_type="general"):
    """
    Document-specific chunking for RAG:
    Splits content differently based on document type.

    Supported types:
    - article → split by paragraphs
    - faq → split by questions
    - code → split by functions/classes
    - general → fixed paragraph split

    Parameters:
    - document: input text
    - doc_type: type of document

    Returns:
    - List of chunks
    """

    chunks = []

    if doc_type == "article":
        # Split by paragraphs
        chunks = [part.strip() for part in document.split("\n\n") if part.strip()]

    elif doc_type == "faq":
        # Split by question marks
        parts = document.split("?")
        chunks = [part.strip() + "?" for part in parts if part.strip()]

    elif doc_type == "code":
        # Simple split by function/class definitions
        lines = document.split("\n")
        current_chunk = []

        for line in lines:
            if line.startswith("def ") or line.startswith("class "):
                if current_chunk:
                    chunks.append("\n".join(current_chunk).strip())
                    current_chunk = []
            current_chunk.append(line)

        if current_chunk:
            chunks.append("\n".join(current_chunk).strip())

    else:
        # Default split by paragraphs
        chunks = [part.strip() for part in document.split("\n\n") if part.strip()]

    return chunks


# Example usage
sample_code = """
class RAGPipeline:
    def __init__(self):
        self.name = "Retrieval-Augmented Generation"

    def retrieve(self, query):
        return f"Retrieving relevant documents for: {query}"


def clean_text(text):
    text = text.lower()
    return text


class EmbeddingModel:
    def generate_embedding(self, text):
        return f"Vector representation for: {text}"
"""

result = document_specific_chunking(sample_code, doc_type="code")

print("Document-Specific Chunks:\n")

for idx, chunk in enumerate(result, 1):
    print(f"Chunk {idx}: {chunk}\n")

Strategy 4 — Semantic Chunking

Split by Meaning, Not Just Format

And now we come to the most advanced strategy:

Semantic Chunking

This is where chunking becomes truly intelligent.

Instead of focusing on:

  • token size
  • formatting
  • paragraph boundaries

it focuses on something deeper:

meaning

This method asks:

Where does one idea end and the next one begin?

That is the heart of semantic chunking.

What Is Semantic Chunking?

In simple words:

Each chunk should represent one complete idea

Not half an idea.

Not two unrelated ideas combined.

One complete semantic unit.

That’s the goal.

This is often called:

Semantic Segmentation

Because we are segmenting the document based on meaning.

A Real Example

Let’s say a paragraph is explaining:

embeddings in RAG

Even if the paragraph is long, semantic chunking tries to keep the full concept together.

Because splitting the idea in the middle would hurt retrieval.

The chunk should feel complete.

When it gets retrieved later, it should make sense on its own.

That dramatically improves answer quality.

This is where semantic chunking becomes incredibly powerful.

The Trade-Off

Of course, there is a cost.

Semantic chunking is slower.

Much slower.

Because now the system must analyze relationships inside the text.

It must understand:

  • topic boundaries
  • concept transitions
  • semantic continuity

That makes it computationally more expensive than simpler chunking strategies.

So it is not always the default choice.

But for high-accuracy systems…

it can be absolutely worth it.

Sample code

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


def semantic_chunking(text, similarity_threshold=0.7):
    """
    Semantic chunking for RAG:
    Groups sentences together based on semantic similarity.

    Parameters:
    - text: input document
    - similarity_threshold: minimum similarity score required
      to keep sentences in the same chunk

    Returns:
    - List of semantic chunks
    """

    # Step 1: Split text into sentences
    sentences = [sentence.strip() for sentence in text.split(".") if sentence.strip()]

    # Step 2: Load embedding model
    model = SentenceTransformer("all-MiniLM-L6-v2")

    # Step 3: Generate embeddings
    embeddings = model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    # Step 4: Compare sentence similarity
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            [embeddings[i - 1]],
            [embeddings[i]]
        )[0][0]

        if similarity >= similarity_threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    # Add final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


# Example usage
sample_text = """
RAG systems use embeddings to convert text into vectors.
These vectors help measure semantic similarity between documents.

Vector databases store these embeddings efficiently.
They allow fast retrieval of the most relevant chunks.

Chunking improves retrieval quality by breaking large documents
into smaller meaningful sections before embedding.

Semantic chunking groups related ideas together
instead of splitting text only by fixed size.
"""

result = semantic_chunking(sample_text, similarity_threshold=0.6)

print("Semantic Chunks:\n")

for idx, chunk in enumerate(result, 1):
    print(f"Chunk {idx}: {chunk}\n")