Imagine asking a large language model the exact same question twice — and getting two very different responses. That’s not a bug. That’s how large language models are designed to work.

Models like GPT-4 or Llama don’t generate text word by word randomly. Instead, at every step, they calculate a probability distribution over possible next tokens. What you finally see depends on how we choose from that distribution during inference.

The Hidden Knobs Behind Text Generation

Modern frameworks — such as OPENAI’s API or Hugging Face’s Transformers — expose a set of controls that influence this selection process. These controls are usually passed into a generate function or an API call.

They’re called inference-time hyperparameters, and they quietly shape the personality of the output.

Why These Settings Matter

Tweaking these parameters can completely change how a model behaves:

Whether the answer is short or long
Whether it sounds safe and repetitive or creative and diverse
Whether it stays tight and coherent or explores multiple possibilities

Because of this, tuning inference-time hyperparameters isn’t optional — it’s essential if you care about the quality of generated text.

Sampling vs. Greedy Decoding: Two Ways to Pick the Next Word

When a language model is generating text, it has to make one key decision over and over again: Which token should come next?

There are two common strategies for making that choice — greedy decoding and sampling — and they behave very differently.

Greedy Decoding: Always Pick the Most Likely Token

Greedy decoding does exactly what the name suggests.
At every step, the model looks at all possible next tokens and chooses the single one with the highest probability.

This process is deterministic
The same prompt will always produce the same output
The result is usually coherent and predictable

However, because it always follows the most common path, greedy decoding can sometimes:

Become repetitive
Get stuck in very safe or obvious continuations

Sampling: Let Probability Decide

Sampling (also called multinomial decoding) takes a different approach.
Instead of always picking the top token, it randomly selects the next token, with probabilities weighted by how likely each token is according to the model.

This introduces randomness
The same prompt can produce different outputs each time
The results are often more diverse and creative

Because randomness is involved, sampling usually needs extra controls — such as temperature or top-k — to keep the output sensible.

How This Works in Practice

In OpenAI's API, this behavior is controlled by a simple flag:

do_sample = False → Greedy decoding
do_sample = True → Sampling

If num_beams = 1 (meaning no beam search):

do_sample = False gives you pure greedy decoding
do_sample = True switches to stochastic sampling

If beam search is enabled (num_beams > 1) and sampling is turned on, the model can even sample among beams, combining both strategies.

When to Use Which?

In practice, the choice depends on the task:

Low-complexity or factual tasks
(like filling in a well-known fact)
→ Greedy decoding works well because precision matters.
Creative tasks
(story generation, brainstorming, open-ended writing)
→ Sampling is preferred because it allows exploration and variety.

Both methods are useful — they just serve different goals in text generation.

Temperature: The Creativity Dial of Text Generation

When a language model is deciding what to say next, it doesn’t just need what tokens are possible — it also needs to know how adventurous it’s allowed to be. That’s exactly what temperature controls.

What Temperature Actually Does

The temperature parameter (usually a number greater than or equal to 0) reshapes the model’s output probabilities.

Behind the scenes, the model:

Computes logits for all possible next tokens
Applies softmax to turn them into probabilities
Divides the logits by the temperature value

A few key points:

Temperature = 1.0 → uses the original probability distribution
Temperature < 1.0 → sharpens the distribution, favoring top tokens
Temperature > 1.0 → flattens the distribution, giving rare tokens more chances

How Temperature Changes the Output

Changing temperature directly changes how random the output feels:

Lower temperature (< 1.0)
Randomness goes down.
The most likely tokens dominate, making outputs more focused and predictable.
Higher temperature (> 1.0)
Randomness goes up.
Even low-probability tokens can be selected, increasing surprise and variation.

At the extremes:

Temperature = 0 → the model always picks the single most likely token (essentially greedy decoding)
Very high temperature (2.0–5.0) → outputs can become chaotic and hard to make sense of

Common Temperature Ranges in Practice

Different tasks call for different settings:

0.2–0.5 → Focused, factual, and precise replies
0.7–1.0 → A balanced mix of coherence and creativity
Above 1.0 → Highly creative exploration, but with more randomness

A Simple Way to Think About It

Temperature is best understood as a creativity dial:

At 0, the model behaves like a strict teacher, insisting on the most obvious answer.
As you turn it up, it starts acting more like a playful author, willing to experiment and free-style.

As described in an OpenAI's guide:

“A lower temperature … makes those tokens with the highest probability more likely to be selected; a higher temperature increases a model’s likelihood of selecting less probable tokens.”

Turn the dial carefully — it decides whether your model sounds precise, balanced, or wildly imaginative.

Top-k Sampling: Putting a Shortlist on Possibilities

When a language model is choosing the next token, it technically considers every word in its entire vocabulary. That’s a huge space — and most of those options are very unlikely.

Top-k sampling is a way to narrow things down.

What Top-k Sampling Does

With top-k sampling, the model:

Looks at the probability distribution over all possible next tokens
Keeps only the top k most probable tokens
Sets the probabilities of all other tokens to zero
Renormalizes the remaining probabilities
Randomly samples the next token from this smaller set

In short, the model only chooses from a shortlist of the k most likely options.

How the Value of k Changes the Output

The size of k directly controls how restricted or flexible the model is:

Small k (e.g. 1 or 5)
Only the most likely tokens can ever be chosen.
This makes outputs more focused and conservative.
Large k (e.g. 50 or 100)
More tokens are allowed into the pool.
This increases diversity and creativity, but also the risk of strange or nonsensical text if k is too large.

An important edge case:

k = 1 → top-k sampling collapses into greedy decoding, since there’s only one possible choice.

Many frameworks use k = 50 by default, as it strikes a balance between coherence and variety.

When to Use Different k Values

Lower k (around 10–20)
Useful when you want safer, high-probability continuations.
Higher k (50 or more)
Helpful for creative tasks, or when the probability distribution is already sharp and you just want a bit of variety.

Why This Helps

Top-k sampling keeps randomness under control.
As described in an OpenAI's example:

“top_k … limits the number of potential tokens from which the model selects. … While random sampling helps produce more varied outputs, this parameter helps maintain quality by excluding the more unlikely tokens.”

Think of top-k as telling the model:
“Be creative — but only among the most reasonable options.”

Top-p (Nucleus) Sampling: Let Probability Decide the Shortlist

Top-p sampling, also called nucleus sampling, is a more flexible cousin of top-k sampling.
Instead of asking the model to consider a fixed number of tokens, it asks a different question:

“How many tokens do we need to cover most of the probability?”

What Top-p Sampling Does

With top-p sampling, you choose a probability threshold p, where
0 < p ≤ 1.

Here’s how the model uses it:

It sorts all possible next tokens by probability
It takes the smallest set of tokens whose cumulative probability is at least p
Only these tokens are kept
The model samples the next token from this reduced set

For example:

p = 0.9 means
“Sample from the smallest group of tokens that together account for 90% of the probability mass.”

Everything outside that group is ignored.

How Top-p Changes the Output

Top-p automatically adapts to how confident the model is:

Sharp distribution (one token is clearly best)
→ Only one or a few tokens are kept
Flat distribution (many tokens are similarly likely)
→ Many tokens are included

This makes top-p more adaptive than top-k. It naturally limits very unlikely tokens without forcing an arbitrary cutoff.

As you change p:

Lower p (e.g. 0.8) → More deterministic, focused generation
p close to 1.0 → Almost all tokens are included, allowing maximum creativity

Common Top-p Settings

0.9–0.95 → Common default values
p = 1.0 → No filtering at all (full distribution)

General behavior:

Small p (0.5–0.8) → Very safe, focused output
Large p (~1.0) → Very creative, but higher risk of hallucination

When and Why to Use Top-p

Nucleus sampling is often recommended in research papers as a safe default for creative text generation.
It avoids keeping tokens that contribute very little probability mass — the long tail of unlikely words.

As described in an OpenAI's explanation:

“allows the model to consider tokens whose cumulative probability is greater than a specified probability. ... the model only selects a group of tokens whose total probability is more than, for instance, 95%. While random sampling enables more dynamic output, top-p ensures some coherence”

So when you set top_p = 0.9, you’re effectively saying: “Let the model be creative — but ignore the least likely 10% of options.”

Repetition Penalty: Teaching the Model Not to Repeat Itself

Sometimes, when a language model is generating text, it gets a bit stuck.
You might see the same word, phrase, or pattern show up again and again —
“the the the…” or long, looping responses.

That’s where the repetition penalty comes in.

What the Repetition Penalty Does

The repetition penalty (often called repetition_penalty in libraries) is designed to reduce the model’s tendency to reuse tokens it has already generated.

During inference:

Each time a new token is produced
Tokens that have appeared before can have their logits re-weighted
This makes previously used tokens less attractive choices going forward

How the Value Changes Behavior

The effect depends on the value you choose:

1.0 → No penalty at all
The model can repeat tokens freely.
Greater than 1.0 → Repetition is discouraged
For example, if the word “dog” has already appeared, its score is reduced the next time it’s considered.

As the value increases:

1.1–1.5 → Mild suppression of repeats
Very high values → Strongly discourage any repetition, helping prevent loops or overly verbose outputs

When to Use a Repetition Penalty

This setting is especially useful when:

The model starts looping
You see obvious repetition like “the the the…”
You want more variety in story generation or code generation

It’s a simple way to nudge the model toward saying something new instead of repeating itself.

A Helpful Way to Think About It

Think of the repetition penalty as gently telling the model:

“Don’t say the same thing over and over.”

Unlike strict rules that completely ban repeated phrases, the repetition penalty is a soft, proportional discouragement — it reduces the chances of repeats without forbidding them entirely.

As the Hugging Face documentation notes:

1.0 means no penalty

A small increase is often all it takes to make the output feel more natural.

Frequency Penalty vs. Presence Penalty: Two Ways to Reduce Repetition

Many chat and completion APIs — such as those from OpenAI — provide two closely related controls to manage repetition: presence penalty and frequency penalty.

They sound similar, but they influence repetition in slightly different ways.

Presence Penalty: Encouraging New Topics

The presence penalty looks at a simple question: Has this token appeared before or not?

Once a word has appeared even once, its logit receives a flat reduction every time it’s considered again.
This makes the model less likely to bring up the same concept repeatedly.
The goal is to encourage the model to introduce new ideas or topics.

With a positive presence penalty:

The model avoids repeating concepts or words already mentioned
The output tends to branch into new directions

Negative values would do the opposite and encourage repetition, but this is rarely used.

As described in the documentation, a positive presence penalty
“increases the model’s likelihood to talk about new topics.”

Frequency Penalty: Controlling How Often Words Repeat

The frequency penalty is more fine-grained.

Instead of just asking whether a token appeared before, it asks: How many times has this token already appeared?

A word used three times is penalized more than a word used once
This helps prevent overusing common words or names
It encourages a richer and more varied vocabulary

As one explanation puts it:

“frequency_penalty is bigger if the token has appeared multiple times.”

Higher frequency penalties lead to outputs that naturally vary their wording and phrasing.

How the Values Work

Both penalties usually range from –2.0 to 2.0:

Positive values → discourage repetition
- Presence penalty → pushes toward new topics
- Frequency penalty → pushes toward varied wording
Negative values → encourage repetition (rarely useful)

Typical Practical Settings

In real usage:

presence_penalty ≈ 0.5–1.0
→ helps avoid repeating the same idea again and again
frequency_penalty ≈ 0.5
→ helps diversify word choice

For very technical outputs, where repeating key terms is important, both penalties are often left at 0.

A Simple Way to Remember the Difference

Presence penalty says:
“You’ve already mentioned this — maybe talk about something new.”
Frequency penalty says:
“You’re using this word a lot — try different wording.”

Together, they give you subtle control over what gets repeated and how often.

Max Tokens / Max Length: Putting a Hard Stop on Output Size

When a language model starts generating text, it doesn’t automatically know when to stop.
That’s why we need a clear rule that says: “This is long enough.”

That rule is set using max tokens (also called max length).

What This Parameter Does

The max tokens / max length setting defines an upper limit on how many tokens the model is allowed to generate after the prompt.

Different frameworks name it slightly differently:

In Hugging Face Transformers, you’ll usually see
- max_new_tokens (recommended), or
- max_length
In the OpenAI API, it’s
- max_tokens or max_completion_tokens

Regardless of the name, the idea is the same:
once the model reaches this token limit, generation stops.

How It Affects the Output

This parameter acts purely as a stop condition.

It does not control creativity, randomness, or coherence
It simply limits how long the response can be

That said, the chosen value still matters:

Too small → the answer may be cut off mid-sentence
Very large → responses can become overly verbose and more expensive to generate

Common Use Cases

You typically set max tokens when you want to control:

Output length
API cost
Context window usage

Examples:

~100 tokens → short, direct answers
500+ tokens → longer explanations or stories

If you also need to enforce a minimum length, some frameworks provide:

min_length
min_new_tokens

An Important Hugging Face Detail

In HuggingFace Transformers, there’s a subtle but important difference:

max_length → counts prompt + generated tokens
max_new_tokens → counts only the tokens generated after the prompt

Because of this, max_new_tokens is often recommended — it makes the limit clearer and avoids accidentally shortening the output because of a long prompt.

The Big Picture

Think of max tokens as a hard ceiling, not a style control.

It doesn’t change how the model writes —
it just decides when the model must stop writing.

Beam Search and Beam Width (`num_beams`): Exploring Multiple Paths at Once

When a language model generates text, it doesn’t have to commit to just one possible sentence as it goes along.
Beam search is a decoding strategy that lets the model keep several promising options alive at the same time.

What Beam Search Does

Instead of following only the single best token at each step (greedy decoding), beam search keeps track of the top k partial sequences, called beams.

The value of num_beams sets this number k, also known as the beam width.

Here’s how it works step by step:

Start with k partial sequences
At each generation step, extend each beam with all possible next tokens
Score all these extended sequences (usually using log-probabilities)
Keep only the top k sequences and discard the rest
Repeat until the sequences are complete

If:

num_beams = 1 → beam search is effectively off (pure greedy decoding)
num_beams > 1 → the model actively explores multiple paths

How Beam Width Affects Output

Beam search usually produces high-probability, coherent sequences, because it compares multiple possibilities before committing.

Key effects:

More polished and sensible outputs
Less randomness than sampling
Deterministic behavior when do_sample = False
Slower generation as the number of beams increases

Using more beams generally improves — or at least doesn’t hurt — the quality of the best sequence, but it comes with k-fold computational cost. Too many beams can also lead to generic or repetitive answers if not balanced with other penalties.

Choosing the Beam Width

Narrow beams (2–4)
Faster, but may miss the best overall sequence
Wider beams (5–10)
More thorough search, better chances of finding the optimal path, but slower

Returning Multiple Sequences

With num_return_sequences > 1, beam search can return the top N completed beams (where N ≤ k).

This gives you:

Multiple candidate outputs
All sorted by their final scores

Early Stopping

In practice, beam search is often combined with early_stopping = True.
This allows generation to stop once all beams have reached an end-of-sequence (EOS) token, instead of forcing shorter completed sequences to keep growing.

When Beam Search Works Best

Beam search is commonly used in tasks where the most likely complete sentence matters, such as:

Machine translation
Summarization

It’s less common for open-ended creative tasks, where randomness is usually preferred. If beams are used for open-ended generation, they’re often combined with:

A length penalty
do_sample = True (multinomial beam search)

This helps preserve diversity while still benefiting from structured search.

A Clear Summary

As described in the Machine Learning Mastery guide:

“It keeps only k best sequences at each step. Each step will expand this set temporarily and prune it back to k best sequences…”

Beam search is essentially about looking ahead down multiple paths — and choosing the best one with confidence.

Early Stopping: Knowing When to Stop Generating

When a model is using beam search, it’s exploring multiple possible sentences at the same time.
But at some point, those sentences are good enough — and continuing to generate more tokens doesn’t really help.

That’s where early stopping comes in.

What Early Stopping Does

Early stopping controls when text generation should terminate once complete sequences are found.

In many libraries, setting:

early_stopping = True means
“Stop as soon as all num_beams beams have produced an end-of-sequence token.”

If early stopping is turned off:

Some beams may keep generating tokens
Generation continues until a length limit or another stopping condition is reached

Some frameworks even support:

early_stopping = "never"
which forces a more exhaustive, canonical beam search.

How It Affects Generation

The biggest impact of early stopping is on speed:

Early stopping ON (True)
- Generation stops as soon as every beam has a complete sentence
- Faster and more efficient
Early stopping OFF (False)
- Beam search may continue if there’s still a chance to improve scores
- Slower, with potentially longer sequences

For deterministic greedy decoding (where there are no beams), this setting has no effect.

As noted in OpenAI's Documentation :

“early_stopping=True stops as soon as there are num_beams complete candidates.”

When to Use Early Stopping

In most cases:

Set early_stopping = True
You get faster generation and don’t over-generate unnecessary tokens.

Only consider turning it off — or setting it to "never" — if you have a specific need to analyze incomplete sequences, which is rare.

The Simple Intuition

Early stopping is like telling the model:

“Once every path has reached a full sentence, we’re done.”

It keeps beam search efficient without changing what the model says — only when it stops.

Length Penalty: Balancing Short and Long Answers in Beam Search

When beam search is scoring different candidate sentences, there’s a subtle bias at play.
Shorter sequences often look better simply because they have fewer tokens — and fewer tokens means fewer chances to lower the overall score.

The length penalty exists to correct for this.

What the Length Penalty Does

During beam search, each candidate sequence is scored using its log-likelihood.
The length penalty modifies this score by adjusting it based on the sequence length.

Technically:

The sequence length is raised to a certain power
This adjusted length is used when scoring the beam
The parameter controlling this is usually called length_penalty

How the Value Changes Output Length

The value you choose directly affects how long the generated text tends to be:

length_penalty = 1.0
No adjustment. This is the default in many libraries.
Greater than 1.0
Encourages longer sequences
For example, a value of 1.2 boosts longer candidates by raising their length to the power of 1.2 before dividing the log-probability.
Less than 1.0
Encourages shorter sequences
A value like 0.8 penalizes longer outputs.

As described in OpenAI Documentation:

“length_penalty > 1.0 promotes longer sequences, < 1.0 encourages shorter.”

When to Use a Length Penalty

Length penalty is useful when:

Beam search outputs end too quickly
Answers feel cut short or incomplete

By slightly increasing the penalty, you can encourage beams to continue generating more complete responses.

It’s also helpful in tasks like translation, where sentences are expected to have a reasonable length and shouldn’t collapse into trivial, overly short outputs.

The Intuition

Think of length penalty as a way of telling the model:

“Don’t favor short answers just because they’re short.”

It helps beam search strike a better balance between probability and completeness.

Logit Bias / Token Bias: Nudging the Model’s Word Choices

Sometimes, you don’t just want to guide a model — you want to directly influence whether specific words appear or not.
That’s exactly what logit bias (also called token bias) is for.

What Logit Bias Does

Logit bias lets you manually adjust the probability of specific tokens before the model samples the next word.

In the OpenAI API, this is done using a parameter called logit_bias:

It’s a dictionary that maps token IDs to a bias value
Bias values typically range from –100 to 100

How those values work:

Positive bias → increases the token’s logit, making it more likely
Negative bias → decreases the token’s logit, making it less likely

How Strong the Bias Can Be

The size of the bias determines how forceful the control is:

–100 → effectively bans the token
(its probability becomes nearly zero)
+100 → effectively forces the token
(it becomes almost guaranteed to appear)
Small values (–1 to +1) → gentle nudges
These slightly decrease or increase how often a token is chosen

For example:

To discourage the word “apple”, you could apply a negative bias to its token
To ensure the model mentions “dog”, you could apply a positive bias to that token

As the documentation explains, logit bias allows you to
“modify the likelihood of specified tokens,”
with values near ±100 acting as hard bans or requirements.

When Logit Bias Is Used

Common use cases include:

Preventing profanity or unwanted terms by applying negative biases
Reducing hallucinations by discouraging certain tokens
Ensuring specific terminology appears when it’s required

A Word of Caution

Logit bias is a powerful, low-level control.
Because it directly alters token probabilities, it can easily distort the model’s natural behavior if overused.

Think of it as a precision tool — best used sparingly and intentionally.

The Intuition

Logit bias is like telling the model:

“Avoid this word.”
“Try really hard to say this word.”

You’re not changing how the model thinks —
you’re just tilting the odds for specific tokens.

Conclusion: From Guessing to Control

So now you know the truth.

Large Language Models aren’t unpredictable black boxes.
They don’t hallucinate randomly.
They respond exactly to the controls you give them.

These aren’t just hyperparameters.
They’re levers of control.

And once you understand them…

You stop fighting the model.
You stop blaming the prompt.
And you start engineering behavior.

This is the difference between:

Using an LLM
And controlling an LLM

Between hoping for a good answer…
and designing one.

The Hidden Knobs Behind Text Generation

Why These Settings Matter

Sampling vs. Greedy Decoding: Two Ways to Pick the Next Word

Greedy Decoding: Always Pick the Most Likely Token

Sampling: Let Probability Decide

How This Works in Practice

When to Use Which?

Temperature: The Creativity Dial of Text Generation

What Temperature Actually Does

How Temperature Changes the Output

Common Temperature Ranges in Practice

A Simple Way to Think About It

Top-k Sampling: Putting a Shortlist on Possibilities

What Top-k Sampling Does

How the Value of k Changes the Output

When to Use Different k Values

Why This Helps

Top-p (Nucleus) Sampling: Let Probability Decide the Shortlist

What Top-p Sampling Does

How Top-p Changes the Output

Common Top-p Settings

When and Why to Use Top-p

Repetition Penalty: Teaching the Model Not to Repeat Itself

What the Repetition Penalty Does

How the Value Changes Behavior

When to Use a Repetition Penalty

A Helpful Way to Think About It

Frequency Penalty vs. Presence Penalty: Two Ways to Reduce Repetition

Presence Penalty: Encouraging New Topics

Frequency Penalty: Controlling How Often Words Repeat

How the Values Work

Typical Practical Settings

A Simple Way to Remember the Difference

Max Tokens / Max Length: Putting a Hard Stop on Output Size

What This Parameter Does

How It Affects the Output

Common Use Cases

An Important Hugging Face Detail

The Big Picture

Beam Search and Beam Width (num_beams): Exploring Multiple Paths at Once

What Beam Search Does

How Beam Width Affects Output

Choosing the Beam Width

Returning Multiple Sequences

Early Stopping

When Beam Search Works Best

A Clear Summary

Early Stopping: Knowing When to Stop Generating

What Early Stopping Does

How It Affects Generation

When to Use Early Stopping

The Simple Intuition

Length Penalty: Balancing Short and Long Answers in Beam Search

What the Length Penalty Does

How the Value Changes Output Length

When to Use a Length Penalty

The Intuition

Logit Bias / Token Bias: Nudging the Model’s Word Choices

What Logit Bias Does

How Strong the Bias Can Be

When Logit Bias Is Used

A Word of Caution

The Intuition

Conclusion: From Guessing to Control

Beam Search and Beam Width (`num_beams`): Exploring Multiple Paths at Once