Published on

Embeddings

Course: Everything about Retrieval Augmented Generation (RAG)

Authors

What Are Embeddings?

How AI Converts Meaning into Math

Embeddings are one of the most important concepts in modern AI, machine learning, and especially systems like RAG.

In fact, once you truly understand embeddings, a lot of advanced AI systems start making perfect sense.

So let’s understand this in the simplest way possible.

The Core Idea

At a simple level, embeddings are numerical representations of data.

This data can be:

  • text
  • images
  • audio
  • documents
  • even user queries

But here’s the key idea:

Embeddings do not just store the exact words.

Instead, they capture the meaning behind the data.

That is what makes them so powerful.

Why Does AI Need Embeddings?

Humans understand words naturally.

When we hear the word dog, we instantly imagine what it means.

But AI does not understand language the way humans do.

It only understands numbers.

So before a machine can work with text, that text must first be converted into numbers.

That conversion is exactly what embeddings do.

They transform words, sentences, paragraphs, or even entire documents into a list of numbers.

This list of numbers is called a vector.

etext=[0.23,1.17,0.84,]\vec{e}_{\text{text}} = [0.23, -1.17, 0.84, \dots]

This is what we call a vector embedding.

The Really Interesting Part

Now here’s where it gets exciting.

These numbers are not random.

They are arranged in such a way that similar meanings stay close together in mathematical space.

Imagine a huge invisible 3D map.

Mermaid Diagram
Rendering…

On this map, words and sentences with similar meanings are placed near each other.

For example:

  • dog and puppy would be very close
  • king and queen would also be close
  • dog and airplane would be far apart

This is how AI understands similarity.

Not through exact word matching…

but through distance in vector space.

This Is Called Semantic Similarity

This ability to understand meaning-based closeness is called semantic similarity.

In simple words:

If two pieces of text mean similar things, their embeddings will be close.

That’s the secret.

One Simple Line to Remember

If you remember just one thing, remember this:

Embeddings are how AI converts meaning into math

Without embeddings, AI would mostly rely on keyword matching.

With embeddings, it understands:

  • context
  • similarity
  • meaning
  • user intent

And that is what makes modern AI systems feel truly intelligent.

Traditional Word Representation

Before embeddings and modern AI models changed everything, there was a much older and simpler way of representing words.

And understanding this older method is incredibly important.

Because once you see its limitations, you’ll immediately understand why embeddings became such a breakthrough.

So let’s start from the beginning.

The Earliest Way Machines Understood Words

Imagine you have a small vocabulary that contains just a few words:

  • cat
  • dog
  • car
  • house

In the traditional approach, every unique word is assigned a unique integer ID.

Something like this:

cat   = 1
dog   = 2
car   = 3
house = 4

Now whenever a sentence comes in, each word is simply replaced by its corresponding number.

For example:

“dog car” → [2, 3]

This was one of the earliest ways machines processed text.

At first glance, it seems simple and efficient.

And honestly, for small examples, it works.

But the real problem starts when the vocabulary becomes large.

The First Big Problem

Language is huge.

Real-world vocabularies can contain:

  • tens of thousands of words
  • hundreds of thousands of terms
  • sometimes even millions

As the vocabulary grows, the number of possible word representations explodes.

And that leads us to one of the most classic techniques in NLP:

One-Hot Encoding

One-hot encoding is one of the simplest ways to represent words.

Let’s say our vocabulary has 4 words.

That means every word will now be represented as a vector of length 4.

For example, the word dog might look like this:

[0,1,0,0][0,1,0,0]

And the word car becomes:

[0,0,1,0][0,0,1,0]

Notice what’s happening.

Every position is 0, except one position, which is 1.

That single 1 tells the model which word it is.

Think of it like a switchboard.

Only one switch is turned on at a time.

That’s exactly why it is called one-hot encoding.

def build_one_hot(text):
    tokens = text.split()
    unique_words = list(set(tokens))
    
    index_map = {w: idx for idx, w in enumerate(unique_words)}
    encoded_vectors = []

    for token in tokens:
        vector = [0] * len(unique_words)
        vector[index_map[token]] = 1
        encoded_vectors.append(vector)

    return encoded_vectors, index_map, unique_words


sample_text = "cat in the hat dog on the mat bird in the tree"

vectors, mapping, vocab = build_one_hot(sample_text)

print("Vocabulary:", vocab)
print("Word Index Map:", mapping)
print("One-Hot Encodings:")

for token, vec in zip(sample_text.split(), vectors):
    print(f"{token}: {vec}")

Sounds Simple… But Here’s the Problem

While this method is easy to understand, it comes with major limitations.

And these limitations are exactly why modern NLP moved away from it.

Problem 1 — High Dimensionality

Let’s imagine your vocabulary has 50,000 words.

That means every single word becomes a vector of length 50,000.

Only one value is 1.

The remaining 49,999 values are 0.

That means:

  • huge memory usage
  • slow computations
  • inefficient storage

The system becomes expensive very quickly.

Especially at scale.

Problem 2 — No Semantic Meaning

This is the much bigger problem.

One-hot encoding does not understand meaning.

For example:

  • king
  • queen

These are clearly related words.

Humans instantly understand the connection.

But in one-hot encoding, their vectors are completely different.

And mathematically, they are just as distant as any unrelated words.

So for the model:

  • king ↔ queen is no closer than
  • king ↔ airplane

That’s a huge limitation.

The machine sees every word as an isolated ID.

There is no understanding of:

  • meaning
  • context
  • relationships
  • similarity

Problem 3 — Out-of-Vocabulary Words

There is one final major issue.

What happens if a completely new word appears later?

For example, suppose the system was never trained on the word:

spaceship

It simply cannot represent it.

Because that word has no predefined slot in the vocabulary.

This is called the out-of-vocabulary problem.

And this was one of the biggest challenges in traditional NLP systems.

Bag of Words

Before modern AI systems started understanding meaning through embeddings, one of the most widely used techniques for representing text was something called Bag of Words, or simply BoW.

The Core Idea

Imagine you have a sentence:

“AI is changing the world”

Now instead of trying to understand:

  • sentence structure
  • grammar
  • meaning
  • word order

the Bag of Words model does something much simpler.

It breaks the sentence into individual words and counts how many times each word appears.

That’s it.

Think of the sentence as a bag filled with words.

The order of the words doesn’t matter.

Only two things matter:

  • whether a word is present
  • how many times it appears

That’s why it is called Bag of Words.

Mermaid Diagram
Rendering…
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(texts)

terms = bow_vectorizer.get_feature_names_out()

print("BoW Representation (Matrix):")
print(bow_matrix.toarray())

print("\nExtracted Vocabulary:")
print(terms)

A Simple Example

Let’s take two sentences:

“AI is powerful”

and

“powerful AI is”

Now as humans, we read them in sequence.

But Bag of Words does not care about the order.

To BoW, both sentences are exactly the same.

Why?

Because both contain the same words with the same frequency.

So the representation becomes something like:

AI        = 1
powerful  = 1
is        = 1

This creates a numerical representation that machine learning models can work with.

Simple.

Fast.

And surprisingly effective for many traditional NLP tasks.

At the time, this was revolutionary.

For tasks like:

  • text classification
  • spam detection
  • sentiment analysis
  • document clustering

Bag of Words worked really well.

Because it converted language into numbers in a very straightforward way.

But There’s a Catch

Actually…

there are two major limitations.

And these limitations are exactly why modern NLP moved beyond it.

Problem 1 — It Ignores Word Order

This is the biggest issue.

Bag of Words completely ignores sequence.

That means it loses all contextual meaning.

Let’s look at two sentences:

“dog bites man”

and

“man bites dog”

In real life, these mean completely different things.

One is normal.

The other is shocking.

But in Bag of Words, they look almost identical.

Because the same words appear with the same frequency.

The sequence is ignored.

That’s a huge loss of meaning.

This is one of the biggest reasons traditional NLP systems struggled with context.

Problem 2 — Sparse Representation

Now let’s imagine a much larger vocabulary.

Suppose your dataset contains 10,000 unique words.

Most documents will only contain a tiny fraction of them.

That means the resulting vector looks something like this:

[0,0,1,0,0,0,1,0,][0,0,1,0,0,0,1,0,…]

Notice how most values are zero.

This is called a sparse matrix.

A sparse matrix means:

  • lots of zeros
  • very few meaningful values

And this becomes highly memory-intensive.

It consumes large amounts of storage and slows down computation.

Especially for large datasets.

TF-IDF

After Bag of Words, one of the biggest breakthroughs in traditional NLP was something called TF-IDF.

And honestly, this technique was a game-changer for early information retrieval systems.

Before embeddings and semantic search, TF-IDF was one of the smartest ways to figure out:

Which words actually matter in a document?

Let’s understand it with a simple story.

The Core Problem

Imagine you have hundreds of documents.

Now you want the machine to figure out which words are truly important in each one.

But there’s a problem.

Some words appear almost everywhere.

Words like:

  • the
  • is
  • and
  • of

These words are frequent…

but they don’t really tell us what the document is about.

They are common, but not informative.

Now compare that with words like:

  • retrieval
  • embedding
  • vector database

These may appear less often across the full collection…

but when they do appear, they carry much stronger meaning.

This is exactly what TF-IDF helps us measure.

What Does TF-IDF Stand For?

TF-IDF stands for:

Term Frequency – Inverse Document Frequency

At first, the name sounds technical.

But once we break it into two parts, it becomes very intuitive.

Part 1 — Term Frequency (TF)

The first part is Term Frequency, or TF.

This simply measures:

How often a word appears inside a specific document

For example, if the word RAG appears 10 times in a document, its term frequency is high.

In simple words:

The more a word appears in a document, the more important it might be for that document.

Part 2 — Inverse Document Frequency (IDF)

Now comes the smarter part.

This is Inverse Document Frequency, or IDF.

This measures:

How rare or unique a word is across the entire collection of documents

Here’s the key idea.

If a word appears in almost every document, its importance should go down.

Because it’s too common.

But if a word appears in only a few documents, its importance should increase.

That makes it much more informative.

This is how TF-IDF filters out common words and highlights meaningful ones.

The Formula

The final score is calculated as:

TF-IDF=TF×IDF\mathrm{TF\text{-}IDF} = \mathrm{TF} \times \mathrm{IDF}

That means:

importance inside the document × uniqueness across all documents

So the higher the TF-IDF score, the more important that word is for that specific document.

A Real-World Example

Let’s imagine you have 1,000 documents.

Now suppose:

  • the word AI appears in 900 documents
  • the word RAG appears in only 20 documents

Clearly, AI is very common.

But RAG is much rarer.

Now if your current document contains the word RAG multiple times, its TF-IDF score becomes high.

That tells the system:

This document is strongly related to RAG

And that is exactly why TF-IDF became so useful in:

  • search engines
  • document ranking
  • keyword extraction
  • text clustering
  • text mining
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "A journey of a thousand miles begins with a single step."
]

tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(corpus)

terms = tfidf.get_feature_names_out()
results = {}

for idx in range(matrix.shape[0]):
    row = matrix[idx]
    non_zero_indices = row.nonzero()[1]

    word_scores = {
        terms[i]: row[0, i] for i in non_zero_indices
    }

    results[idx] = word_scores

for idx, word_dict in results.items():
    print(f"Text {idx + 1}:")
    for word, score in word_dict.items():
        print(f"{word}: {score}")
    print()

Why It Was So Powerful

For early information retrieval systems, this was a huge step forward.

Instead of treating every word equally, TF-IDF helped machines figure out:

Which words truly define the topic of a document

That was revolutionary at the time.

But It Still Has Limitations

Just like previous traditional methods, TF-IDF also has its weaknesses.

Problem 1 — No Context Understanding

This is the biggest limitation.

TF-IDF still does not understand meaning.

For example, take the word:

bank

This can mean:

  • a financial institution
  • the side of a river

Humans easily understand the difference from context.

But TF-IDF treats both exactly the same.

It only looks at frequency statistics.

It does not understand semantic meaning.

Problem 2 — Document Length Bias

There is another issue.

Longer documents naturally contain more words.

That means some words may repeat more often simply because the document is longer.

This can sometimes give artificially higher scores.

So document length can influence the result.

Advance Word Representation

So far, we’ve explored traditional text representation methods like:

  • one-hot encoding
  • Bag of Words
  • TF-IDF

These methods were useful.

They helped machines process text.

But they all had one major limitation.

They could count words

but they could not truly understand meaning.

And this is exactly where NLP took a massive leap forward with Word2Vec.

Word2Vec

Word2Vec was one of the most revolutionary ideas in NLP.

For the first time, machines could start learning the semantic relationships between words.

Instead of representing words as isolated IDs or sparse vectors filled with zeros…

Word2Vec represents words as dense continuous vectors.

In simple words:

Every word gets its own learned numerical representation.

And these numbers are not random.

They are learned in a way that captures meaning.

Imagine Words as Points in Space

A beautiful way to understand Word2Vec is to imagine every word as a point in a huge high-dimensional space.

Words with similar meanings are placed closer together.

For example:

  • king
  • queen
  • prince
  • ruler

would all be close in this space.

But unrelated words like:

  • airplane
  • pizza

would be much farther away.

That is the entire goal of Word2Vec:

Capture meaning through vector relationships

In one simple line:

similar words should have similar vector representations

That idea changed NLP forever.

How Does It Learn?

At the beginning, word vectors may start randomly.

But during training, the neural network keeps adjusting them.

Over time, meaningful patterns begin to emerge.

And this is where the magic happens.

One of the most famous examples is:

kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

This shows that the model is not just memorizing words.

It is learning relationships inside language.

That was revolutionary.

CBOW

Word2Vec mainly has two learning methods:

  • CBOW
  • Skip-gram

In this course, let’s focus on CBOW.

CBOW stands for Continuous Bag of Words.

The idea is simple but incredibly smart.

The model tries to predict a missing word using the surrounding words.

Think of it like filling in the blank.

Let’s take this sentence:

“RAG improves language models significantly”

Now imagine we hide the center word:

“RAG ____ language models significantly”

The surrounding words become the context.

The model’s job is to predict the missing word.

In this case:

improves

That is exactly how CBOW learns.

Technically, CBOW uses a feedforward neural network with a single hidden layer.

Let’s simplify what each layer does.

Input Layer

This contains the surrounding context words.

For example:

  • RAG
  • language
  • models
  • significantly

These are fed into the network.

Hidden Layer

This is the most important layer.

This layer stores the learned continuous vector representations, also known as embeddings.

This is where the meaning of words gets stored.

The weights in this layer eventually become the word vectors.

Output Layer

This layer predicts the target word.

For example:

improves

So the full flow becomes:

context → hidden representation → predicted word

Why Hidden Layer Size Matters

The size of the hidden layer determines the embedding dimension.

For example:

  • 100 dimensions
  • 300 dimensions
  • 768 dimensions

The larger the dimension, the richer the representation.

More dimensions usually allow the model to capture more nuanced meaning.

import torch
import torch.nn as nn
import torch.optim as optim

# CBOW Architecture
class CBOW(nn.Module):
    def __init__(self, vocab_len, embedding_dim):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocab_len, embedding_dim)
        self.output_layer = nn.Linear(embedding_dim, vocab_len)

    def forward(self, context_words):
        embeds = self.embedding_layer(context_words)
        combined = embeds.sum(dim=1)
        logits = self.output_layer(combined)
        return logits


# Prepare data
window = 2
sentence = "word embeddings are awesome"
tokens = sentence.split()

vocab_list = list(set(tokens))
word_to_id = {w: i for i, w in enumerate(vocab_list)}

training_samples = []
for i in range(window, len(tokens) - window):
    context_ids = [
        word_to_id[w] for w in tokens[i - window:i] + tokens[i + 1:i + window + 1]
    ]
    target_id = word_to_id[tokens[i]]
    training_samples.append(
        (torch.tensor(context_ids), torch.tensor(target_id))
    )

# Hyperparameters
vocab_len = len(vocab_list)
embedding_dim = 10
lr = 0.01
num_epochs = 100

# Model setup
model = CBOW(vocab_len, embedding_dim)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=lr)

# Training
for ep in range(num_epochs):
    epoch_loss = 0.0
    for ctx, tgt in training_samples:
        optimizer.zero_grad()
        preds = model(ctx)
        loss = loss_fn(preds.unsqueeze(0), tgt.unsqueeze(0))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    print(f"Epoch {ep + 1}: Loss = {epoch_loss:.4f}")

# Retrieve embedding
query_word = "embeddings"
query_id = word_to_id[query_word]
vector = model.embedding_layer(torch.tensor([query_id]))

print(f"Vector for '{query_word}':\n", vector.detach().numpy())

Why This Changed Everything

This was a huge breakthrough.

Because now machines could move beyond counting words.

They could start learning:

  • meaning
  • similarity
  • context
  • relationships

And this idea directly laid the foundation for modern embeddings used in:

  • RAG systems
  • semantic search
  • vector databases
  • LLMs

Without Word2Vec, modern semantic search would not exist.

Code Example — Using OpenAI Embeddings

Now that we understand what embeddings are and why they are so important in RAG, let’s move from theory to actual code.

Because this is where things start getting really exciting.

So far, we’ve talked about concepts like:

  • meaning
  • semantic similarity
  • vector space
  • retrieval

Now it’s time to actually generate an embedding in Python.

And once you see how simple this is, the entire RAG pipeline will start feeling much more real.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG helps large language models retrieve relevant information"
)

embedding_vector = response.data[0].embedding

print(embedding_vector[:10])   # print first 10 values
print(len(embedding_vector))   # print vector size

At first glance, the code looks simple.

And honestly, that’s one of the best things about modern APIs.

But let’s understand what each part is doing.

Step 1 — Create the Client

First, we create the OpenAI client.

client = OpenAI()

Think of this as opening a connection to the API.

Now our Python application can start interacting with OpenAI models.

Step 2 — Generate the Embedding

Next, we call the embeddings endpoint.

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG helps large language models retrieve relevant information"
)

Here we pass two things:

  • the embedding model
  • the input text

The recommended embedding models commonly used for retrieval tasks include:

  • text-embedding-3-small
  • text-embedding-3-large

These are widely used for:

  • semantic search
  • document retrieval
  • vector databases
  • RAG systems

Step 3 — Extract the Vector

Now the API returns a response.

Inside that response is the embedding vector.

embedding_vector = response.data[0].embedding

This is the actual numerical representation of the sentence.

It looks something like this:

[0.021, -0.184, 0.762, ...]

These numbers are floating-point values.

And together, they mathematically capture the meaning of the sentence.

esentence=[0.021,0.184,0.762,]\vec{e}_{\text{sentence}} = [0.021, -0.184, 0.762, \dots]

That is the embedding.

How This Connects to RAG

This is exactly how RAG systems retrieve information.

Instead of matching exact keywords, the system compares vector similarity.

So even if the words are different, similar meanings still match.

That is how the system retrieves the most relevant chunks from a vector database.

This is the heart of semantic retrieval.

And this is what makes modern AI systems feel intelligent.

Which Embedding Model Should You Use?

Now that we understand what embeddings are, the next big question naturally becomes:

Which embedding model should you actually use?

And this is an important question.

Because today, there are many commercial and open-source embedding models available.

The model you choose can directly impact:

  • retrieval quality
  • latency
  • infrastructure cost
  • scalability
  • multilingual performance

In short:

the success of your RAG system

So let’s break this down in a simple, practical way.

Option 1 — Proprietary / Plug-and-Play Models

The easiest way to get started is with proprietary API-based models.

Popular choices include models from:

  • OpenAI
  • Cohere

For example, OpenAI provides models like:

  • text-embedding-3-small
  • text-embedding-3-large

These are essentially plug-and-play.

You send text through an API…

and instantly get embeddings back.

It’s incredibly simple.

No infrastructure.

No model hosting.

No maintenance.

This makes development extremely fast.

Perfect for quickly building production-ready RAG systems.

Why API Models Are Great

The biggest advantage is convenience.

You can focus entirely on your application logic instead of worrying about deployment.

This is ideal for:

  • fast prototyping
  • MVPs
  • startup products
  • enterprise pilots

In many real-world projects, this dramatically reduces development time.

But There’s a Trade-Off

Of course, convenience comes at a cost.

The more tokens you embed, the more you pay.

So API models are excellent for:

speed and ease

But budget becomes an important consideration at scale.

Especially when embedding millions of document chunks.

Option 2 — Open-Source Models

On the other side, we have open-source embedding models.

These are extremely popular in large-scale production RAG systems.

Some common examples include:

  • all-MiniLM-L6-v2
  • instructor-base
  • BGE-M3
  • E5

These models can be self-hosted.

That means much lower long-term cost.

Especially at scale.

FAISS and self-hosted vector pipelines often pair well with these models.

The Real Trade-Off

This becomes a classic engineering trade-off:

convenience vs control

With open-source models, you gain:

  • lower long-term cost
  • full control
  • custom deployment options
  • on-prem hosting

But now you must manage:

  • infrastructure
  • deployment
  • scaling
  • monitoring
  • latency optimization

So while cost improves, operational complexity increases.

Model Size vs Speed

This is one of the most important practical trade-offs.

Larger embedding models often produce better semantic representations.

Which usually means better retrieval quality.

But they are also:

  • slower
  • more expensive
  • heavier to store

For example:

  • 1536-dimensional vectors → richer meaning
  • 768-dimensional vectors → faster and cheaper
15367681536 \rightarrow 768

In many production systems, reducing vector size can cut storage cost by nearly 50% while maintaining similar retrieval quality.

That’s huge when working with millions of chunks.

Because vector size directly affects:

  • database size
  • RAM usage
  • query speed
  • infrastructure cost

Multilingual Support

Another major factor is language support.

If your application works across multiple languages, this becomes critical.

Some models are specifically designed for multilingual use cases.

For example:

  • multilingual sentence transformers
  • multilingual OpenAI embeddings

These models are trained so that semantically similar sentences remain close even across different languages.

That means something like:

English query → Hindi document

can still match correctly.

For global products, this is extremely powerful.

Fine-Tuning for Domain Knowledge

Now here’s where things get even more interesting.

Embedding models can also be fine-tuned on your own domain data.

This is incredibly useful for specialized industries such as:

  • legal
  • healthcare
  • finance
  • enterprise internal knowledge

For example, if you’re building a RAG system for legal contracts, a general-purpose embedding model may not fully understand domain terminology.

Fine-tuning solves this.

This is a form of transfer learning.

The Final Decision Framework

So how do you choose?

In the real world, the decision usually comes down to four things:

  • accuracy
  • latency
  • budget
  • domain specificity

Here’s the simple framework:

If you want speed and zero maintenance

Use API models

If you want scale and cost optimization

Use open-source models

If domain relevance is critical

fine-tune your embeddings

That is the real-world decision framework used in modern RAG systems.

Previous Lesson

What is RAG?