- Published on
Inference-Time Hyperparameters in LLMs
- Authors

- Name
- Rohan Verma
- @m3verma
Imagine asking a large language model the exact same question twice — and getting two very different responses. That’s not a bug. That’s how large language models are designed to work.
Models like GPT-4 or Llama don’t generate text word by word randomly. Instead, at every step, they calculate a probability distribution over possible next tokens. What you finally see depends on how we choose from that distribution during inference.
The Hidden Knobs Behind Text Generation
Modern frameworks — such as OPENAI’s API or Hugging Face’s Transformers — expose a set of controls that influence this selection process. These controls are usually passed into a generate function or an API call.
They’re called inference-time hyperparameters, and they quietly shape the personality of the output.
Why These Settings Matter
Tweaking these parameters can completely change how a model behaves:
- Whether the answer is short or long
- Whether it sounds safe and repetitive or creative and diverse
- Whether it stays tight and coherent or explores multiple possibilities
Because of this, tuning inference-time hyperparameters isn’t optional — it’s essential if you care about the quality of generated text.
Sampling vs. Greedy Decoding: Two Ways to Pick the Next Word
When a language model is generating text, it has to make one key decision over and over again: Which token should come next?
There are two common strategies for making that choice — greedy decoding and sampling — and they behave very differently.
Greedy Decoding: Always Pick the Most Likely Token
Greedy decoding does exactly what the name suggests.
At every step, the model looks at all possible next tokens and chooses the single one with the highest probability.
- This process is deterministic
- The same prompt will always produce the same output
- The result is usually coherent and predictable
However, because it always follows the most common path, greedy decoding can sometimes:
- Become repetitive
- Get stuck in very safe or obvious continuations
Sampling: Let Probability Decide
Sampling (also called multinomial decoding) takes a different approach.
Instead of always picking the top token, it randomly selects the next token, with probabilities weighted by how likely each token is according to the model.
- This introduces randomness
- The same prompt can produce different outputs each time
- The results are often more diverse and creative
Because randomness is involved, sampling usually needs extra controls — such as temperature or top-k — to keep the output sensible.
How This Works in Practice
In OpenAI's API, this behavior is controlled by a simple flag:
do_sample = False→ Greedy decodingdo_sample = True→ Sampling
If num_beams = 1 (meaning no beam search):
do_sample = Falsegives you pure greedy decodingdo_sample = Trueswitches to stochastic sampling
If beam search is enabled (num_beams > 1) and sampling is turned on, the model can even sample among beams, combining both strategies.
When to Use Which?
In practice, the choice depends on the task:
Low-complexity or factual tasks
(like filling in a well-known fact)
→ Greedy decoding works well because precision matters.Creative tasks
(story generation, brainstorming, open-ended writing)
→ Sampling is preferred because it allows exploration and variety.
Both methods are useful — they just serve different goals in text generation.
Temperature: The Creativity Dial of Text Generation
When a language model is deciding what to say next, it doesn’t just need what tokens are possible — it also needs to know how adventurous it’s allowed to be. That’s exactly what temperature controls.
What Temperature Actually Does
The temperature parameter (usually a number greater than or equal to 0) reshapes the model’s output probabilities.
Behind the scenes, the model:
- Computes logits for all possible next tokens
- Applies softmax to turn them into probabilities
- Divides the logits by the temperature value
A few key points:
- Temperature = 1.0 → uses the original probability distribution
- Temperature < 1.0 → sharpens the distribution, favoring top tokens
- Temperature > 1.0 → flattens the distribution, giving rare tokens more chances
How Temperature Changes the Output
Changing temperature directly changes how random the output feels:
Lower temperature (< 1.0)
Randomness goes down.
The most likely tokens dominate, making outputs more focused and predictable.Higher temperature (> 1.0)
Randomness goes up.
Even low-probability tokens can be selected, increasing surprise and variation.
At the extremes:
- Temperature = 0 → the model always picks the single most likely token (essentially greedy decoding)
- Very high temperature (2.0–5.0) → outputs can become chaotic and hard to make sense of
Common Temperature Ranges in Practice
Different tasks call for different settings:
- 0.2–0.5 → Focused, factual, and precise replies
- 0.7–1.0 → A balanced mix of coherence and creativity
- Above 1.0 → Highly creative exploration, but with more randomness
A Simple Way to Think About It
Temperature is best understood as a creativity dial:
- At 0, the model behaves like a strict teacher, insisting on the most obvious answer.
- As you turn it up, it starts acting more like a playful author, willing to experiment and free-style.
As described in an OpenAI's guide:
“A lower temperature … makes those tokens with the highest probability more likely to be selected; a higher temperature increases a model’s likelihood of selecting less probable tokens.”
Turn the dial carefully — it decides whether your model sounds precise, balanced, or wildly imaginative.
Top-k Sampling: Putting a Shortlist on Possibilities
When a language model is choosing the next token, it technically considers every word in its entire vocabulary. That’s a huge space — and most of those options are very unlikely.
Top-k sampling is a way to narrow things down.
What Top-k Sampling Does
With top-k sampling, the model:
- Looks at the probability distribution over all possible next tokens
- Keeps only the top k most probable tokens
- Sets the probabilities of all other tokens to zero
- Renormalizes the remaining probabilities
- Randomly samples the next token from this smaller set
In short, the model only chooses from a shortlist of the k most likely options.
How the Value of k Changes the Output
The size of k directly controls how restricted or flexible the model is:
Small k (e.g. 1 or 5)
Only the most likely tokens can ever be chosen.
This makes outputs more focused and conservative.Large k (e.g. 50 or 100)
More tokens are allowed into the pool.
This increases diversity and creativity, but also the risk of strange or nonsensical text if k is too large.
An important edge case:
- k = 1 → top-k sampling collapses into greedy decoding, since there’s only one possible choice.
Many frameworks use k = 50 by default, as it strikes a balance between coherence and variety.
When to Use Different k Values
Lower k (around 10–20)
Useful when you want safer, high-probability continuations.Higher k (50 or more)
Helpful for creative tasks, or when the probability distribution is already sharp and you just want a bit of variety.
Why This Helps
Top-k sampling keeps randomness under control.
As described in an OpenAI's example:
“top_k … limits the number of potential tokens from which the model selects. … While random sampling helps produce more varied outputs, this parameter helps maintain quality by excluding the more unlikely tokens.”
Think of top-k as telling the model:
“Be creative — but only among the most reasonable options.”
Top-p (Nucleus) Sampling: Let Probability Decide the Shortlist
Top-p sampling, also called nucleus sampling, is a more flexible cousin of top-k sampling.
Instead of asking the model to consider a fixed number of tokens, it asks a different question:
“How many tokens do we need to cover most of the probability?”
What Top-p Sampling Does
With top-p sampling, you choose a probability threshold p, where0 < p ≤ 1.
Here’s how the model uses it:
- It sorts all possible next tokens by probability
- It takes the smallest set of tokens whose cumulative probability is at least
p - Only these tokens are kept
- The model samples the next token from this reduced set
For example:
- p = 0.9 means
“Sample from the smallest group of tokens that together account for 90% of the probability mass.”
Everything outside that group is ignored.
How Top-p Changes the Output
Top-p automatically adapts to how confident the model is:
Sharp distribution (one token is clearly best)
→ Only one or a few tokens are keptFlat distribution (many tokens are similarly likely)
→ Many tokens are included
This makes top-p more adaptive than top-k. It naturally limits very unlikely tokens without forcing an arbitrary cutoff.
As you change p:
- Lower p (e.g. 0.8) → More deterministic, focused generation
- p close to 1.0 → Almost all tokens are included, allowing maximum creativity
Common Top-p Settings
- 0.9–0.95 → Common default values
- p = 1.0 → No filtering at all (full distribution)
General behavior:
- Small p (0.5–0.8) → Very safe, focused output
- Large p (~1.0) → Very creative, but higher risk of hallucination
When and Why to Use Top-p
Nucleus sampling is often recommended in research papers as a safe default for creative text generation.
It avoids keeping tokens that contribute very little probability mass — the long tail of unlikely words.
As described in an OpenAI's explanation:
“allows the model to consider tokens whose cumulative probability is greater than a specified probability. ... the model only selects a group of tokens whose total probability is more than, for instance, 95%. While random sampling enables more dynamic output, top-p ensures some coherence”
So when you set top_p = 0.9, you’re effectively saying: “Let the model be creative — but ignore the least likely 10% of options.”
Repetition Penalty: Teaching the Model Not to Repeat Itself
Sometimes, when a language model is generating text, it gets a bit stuck.
You might see the same word, phrase, or pattern show up again and again —
“the the the…” or long, looping responses.
That’s where the repetition penalty comes in.
What the Repetition Penalty Does
The repetition penalty (often called repetition_penalty in libraries) is designed to reduce the model’s tendency to reuse tokens it has already generated.
During inference:
- Each time a new token is produced
- Tokens that have appeared before can have their logits re-weighted
- This makes previously used tokens less attractive choices going forward
How the Value Changes Behavior
The effect depends on the value you choose:
1.0 → No penalty at all
The model can repeat tokens freely.Greater than 1.0 → Repetition is discouraged
For example, if the word “dog” has already appeared, its score is reduced the next time it’s considered.
As the value increases:
- 1.1–1.5 → Mild suppression of repeats
- Very high values → Strongly discourage any repetition, helping prevent loops or overly verbose outputs
When to Use a Repetition Penalty
This setting is especially useful when:
- The model starts looping
- You see obvious repetition like “the the the…”
- You want more variety in story generation or code generation
It’s a simple way to nudge the model toward saying something new instead of repeating itself.
A Helpful Way to Think About It
Think of the repetition penalty as gently telling the model:
“Don’t say the same thing over and over.”
Unlike strict rules that completely ban repeated phrases, the repetition penalty is a soft, proportional discouragement — it reduces the chances of repeats without forbidding them entirely.
As the Hugging Face documentation notes:
- 1.0 means no penalty
A small increase is often all it takes to make the output feel more natural.
Frequency Penalty vs. Presence Penalty: Two Ways to Reduce Repetition
Many chat and completion APIs — such as those from OpenAI — provide two closely related controls to manage repetition: presence penalty and frequency penalty.
They sound similar, but they influence repetition in slightly different ways.
Presence Penalty: Encouraging New Topics
The presence penalty looks at a simple question: Has this token appeared before or not?
- Once a word has appeared even once, its logit receives a flat reduction every time it’s considered again.
- This makes the model less likely to bring up the same concept repeatedly.
- The goal is to encourage the model to introduce new ideas or topics.
With a positive presence penalty:
- The model avoids repeating concepts or words already mentioned
- The output tends to branch into new directions
Negative values would do the opposite and encourage repetition, but this is rarely used.
As described in the documentation, a positive presence penalty
“increases the model’s likelihood to talk about new topics.”
Frequency Penalty: Controlling How Often Words Repeat
The frequency penalty is more fine-grained.
Instead of just asking whether a token appeared before, it asks: How many times has this token already appeared?
- A word used three times is penalized more than a word used once
- This helps prevent overusing common words or names
- It encourages a richer and more varied vocabulary
As one explanation puts it:
“frequency_penalty is bigger if the token has appeared multiple times.”
Higher frequency penalties lead to outputs that naturally vary their wording and phrasing.
How the Values Work
Both penalties usually range from –2.0 to 2.0:
Positive values → discourage repetition
- Presence penalty → pushes toward new topics
- Frequency penalty → pushes toward varied wording
Negative values → encourage repetition (rarely useful)
Typical Practical Settings
In real usage:
presence_penalty ≈ 0.5–1.0
→ helps avoid repeating the same idea again and againfrequency_penalty ≈ 0.5
→ helps diversify word choice
For very technical outputs, where repeating key terms is important, both penalties are often left at 0.
A Simple Way to Remember the Difference
Presence penalty says:
“You’ve already mentioned this — maybe talk about something new.”Frequency penalty says:
“You’re using this word a lot — try different wording.”
Together, they give you subtle control over what gets repeated and how often.
Max Tokens / Max Length: Putting a Hard Stop on Output Size
When a language model starts generating text, it doesn’t automatically know when to stop.
That’s why we need a clear rule that says: “This is long enough.”
That rule is set using max tokens (also called max length).
What This Parameter Does
The max tokens / max length setting defines an upper limit on how many tokens the model is allowed to generate after the prompt.
Different frameworks name it slightly differently:
- In Hugging Face Transformers, you’ll usually see
max_new_tokens(recommended), ormax_length
- In the OpenAI API, it’s
max_tokensormax_completion_tokens
Regardless of the name, the idea is the same:
once the model reaches this token limit, generation stops.
How It Affects the Output
This parameter acts purely as a stop condition.
- It does not control creativity, randomness, or coherence
- It simply limits how long the response can be
That said, the chosen value still matters:
- Too small → the answer may be cut off mid-sentence
- Very large → responses can become overly verbose and more expensive to generate
Common Use Cases
You typically set max tokens when you want to control:
- Output length
- API cost
- Context window usage
Examples:
- ~100 tokens → short, direct answers
- 500+ tokens → longer explanations or stories
If you also need to enforce a minimum length, some frameworks provide:
min_lengthmin_new_tokens
An Important Hugging Face Detail
In HuggingFace Transformers, there’s a subtle but important difference:
max_length→ counts prompt + generated tokensmax_new_tokens→ counts only the tokens generated after the prompt
Because of this, max_new_tokens is often recommended — it makes the limit clearer and avoids accidentally shortening the output because of a long prompt.
The Big Picture
Think of max tokens as a hard ceiling, not a style control.
It doesn’t change how the model writes —
it just decides when the model must stop writing.
Beam Search and Beam Width (num_beams): Exploring Multiple Paths at Once
When a language model generates text, it doesn’t have to commit to just one possible sentence as it goes along.
Beam search is a decoding strategy that lets the model keep several promising options alive at the same time.
What Beam Search Does
Instead of following only the single best token at each step (greedy decoding), beam search keeps track of the top k partial sequences, called beams.
The value of num_beams sets this number k, also known as the beam width.
Here’s how it works step by step:
- Start with
kpartial sequences - At each generation step, extend each beam with all possible next tokens
- Score all these extended sequences (usually using log-probabilities)
- Keep only the top k sequences and discard the rest
- Repeat until the sequences are complete
If:
num_beams = 1→ beam search is effectively off (pure greedy decoding)num_beams > 1→ the model actively explores multiple paths
How Beam Width Affects Output
Beam search usually produces high-probability, coherent sequences, because it compares multiple possibilities before committing.
Key effects:
- More polished and sensible outputs
- Less randomness than sampling
- Deterministic behavior when
do_sample = False - Slower generation as the number of beams increases
Using more beams generally improves — or at least doesn’t hurt — the quality of the best sequence, but it comes with k-fold computational cost. Too many beams can also lead to generic or repetitive answers if not balanced with other penalties.
Choosing the Beam Width
Narrow beams (2–4)
Faster, but may miss the best overall sequenceWider beams (5–10)
More thorough search, better chances of finding the optimal path, but slower
Returning Multiple Sequences
With num_return_sequences > 1, beam search can return the top N completed beams (where N ≤ k).
This gives you:
- Multiple candidate outputs
- All sorted by their final scores
Early Stopping
In practice, beam search is often combined with early_stopping = True.
This allows generation to stop once all beams have reached an end-of-sequence (EOS) token, instead of forcing shorter completed sequences to keep growing.
When Beam Search Works Best
Beam search is commonly used in tasks where the most likely complete sentence matters, such as:
- Machine translation
- Summarization
It’s less common for open-ended creative tasks, where randomness is usually preferred. If beams are used for open-ended generation, they’re often combined with:
- A length penalty
do_sample = True(multinomial beam search)
This helps preserve diversity while still benefiting from structured search.
A Clear Summary
As described in the Machine Learning Mastery guide:
“It keeps only k best sequences at each step. Each step will expand this set temporarily and prune it back to k best sequences…”
Beam search is essentially about looking ahead down multiple paths — and choosing the best one with confidence.
Early Stopping: Knowing When to Stop Generating
When a model is using beam search, it’s exploring multiple possible sentences at the same time.
But at some point, those sentences are good enough — and continuing to generate more tokens doesn’t really help.
That’s where early stopping comes in.
What Early Stopping Does
Early stopping controls when text generation should terminate once complete sequences are found.
In many libraries, setting:
early_stopping = Truemeans
“Stop as soon as allnum_beamsbeams have produced an end-of-sequence token.”
If early stopping is turned off:
- Some beams may keep generating tokens
- Generation continues until a length limit or another stopping condition is reached
Some frameworks even support:
early_stopping = "never"
which forces a more exhaustive, canonical beam search.
How It Affects Generation
The biggest impact of early stopping is on speed:
Early stopping ON (
True)- Generation stops as soon as every beam has a complete sentence
- Faster and more efficient
Early stopping OFF (
False)- Beam search may continue if there’s still a chance to improve scores
- Slower, with potentially longer sequences
For deterministic greedy decoding (where there are no beams), this setting has no effect.
As noted in OpenAI's Documentation :
“early_stopping=True stops as soon as there are num_beams complete candidates.”
When to Use Early Stopping
In most cases:
- Set
early_stopping = True
You get faster generation and don’t over-generate unnecessary tokens.
Only consider turning it off — or setting it to "never" — if you have a specific need to analyze incomplete sequences, which is rare.
The Simple Intuition
Early stopping is like telling the model:
“Once every path has reached a full sentence, we’re done.”
It keeps beam search efficient without changing what the model says — only when it stops.
Length Penalty: Balancing Short and Long Answers in Beam Search
When beam search is scoring different candidate sentences, there’s a subtle bias at play.
Shorter sequences often look better simply because they have fewer tokens — and fewer tokens means fewer chances to lower the overall score.
The length penalty exists to correct for this.
What the Length Penalty Does
During beam search, each candidate sequence is scored using its log-likelihood.
The length penalty modifies this score by adjusting it based on the sequence length.
Technically:
- The sequence length is raised to a certain power
- This adjusted length is used when scoring the beam
- The parameter controlling this is usually called
length_penalty
How the Value Changes Output Length
The value you choose directly affects how long the generated text tends to be:
length_penalty = 1.0
No adjustment. This is the default in many libraries.Greater than 1.0
Encourages longer sequences
For example, a value of 1.2 boosts longer candidates by raising their length to the power of 1.2 before dividing the log-probability.Less than 1.0
Encourages shorter sequences
A value like 0.8 penalizes longer outputs.
As described in OpenAI Documentation:
“length_penalty
>1.0 promotes longer sequences,<1.0 encourages shorter.”
When to Use a Length Penalty
Length penalty is useful when:
- Beam search outputs end too quickly
- Answers feel cut short or incomplete
By slightly increasing the penalty, you can encourage beams to continue generating more complete responses.
It’s also helpful in tasks like translation, where sentences are expected to have a reasonable length and shouldn’t collapse into trivial, overly short outputs.
The Intuition
Think of length penalty as a way of telling the model:
“Don’t favor short answers just because they’re short.”
It helps beam search strike a better balance between probability and completeness.
Logit Bias / Token Bias: Nudging the Model’s Word Choices
Sometimes, you don’t just want to guide a model — you want to directly influence whether specific words appear or not.
That’s exactly what logit bias (also called token bias) is for.
What Logit Bias Does
Logit bias lets you manually adjust the probability of specific tokens before the model samples the next word.
In the OpenAI API, this is done using a parameter called logit_bias:
- It’s a dictionary that maps token IDs to a bias value
- Bias values typically range from –100 to 100
How those values work:
- Positive bias → increases the token’s logit, making it more likely
- Negative bias → decreases the token’s logit, making it less likely
How Strong the Bias Can Be
The size of the bias determines how forceful the control is:
–100 → effectively bans the token
(its probability becomes nearly zero)+100 → effectively forces the token
(it becomes almost guaranteed to appear)Small values (–1 to +1) → gentle nudges
These slightly decrease or increase how often a token is chosen
For example:
- To discourage the word “apple”, you could apply a negative bias to its token
- To ensure the model mentions “dog”, you could apply a positive bias to that token
As the documentation explains, logit bias allows you to
“modify the likelihood of specified tokens,”
with values near ±100 acting as hard bans or requirements.
When Logit Bias Is Used
Common use cases include:
- Preventing profanity or unwanted terms by applying negative biases
- Reducing hallucinations by discouraging certain tokens
- Ensuring specific terminology appears when it’s required
A Word of Caution
Logit bias is a powerful, low-level control.
Because it directly alters token probabilities, it can easily distort the model’s natural behavior if overused.
Think of it as a precision tool — best used sparingly and intentionally.
The Intuition
Logit bias is like telling the model:
- “Avoid this word.”
- “Try really hard to say this word.”
You’re not changing how the model thinks —
you’re just tilting the odds for specific tokens.
Conclusion: From Guessing to Control
So now you know the truth.
Large Language Models aren’t unpredictable black boxes.
They don’t hallucinate randomly.
They respond exactly to the controls you give them.
These aren’t just hyperparameters.
They’re levers of control.
And once you understand them…
You stop fighting the model.
You stop blaming the prompt.
And you start engineering behavior.
This is the difference between:
- Using an LLM
- And controlling an LLM
Between hoping for a good answer…
and designing one.