DataTechNotes: How to Use Top-P and Top-K Sampling in LLMs

In this post, we'll briefly learn what Top-K and Top-P sampling are, how they differ from temperature, and how to tune them to control the quality and diversity of LLM output in Python. The tutorial covers:

What are Top-K and Top-P Sampling?
How Top-K Sampling Works
How Top-P Sampling Works
Installation and Setup
Effect of Top-K on Output
Effect of Top-P on Output
Comparing Top-K and Top-P Directly
Combining Temperature, Top-K, and Top-P
Choosing the Right Sampling Parameters
Conclusion

Let's get started.

What are Top-K and Top-P Sampling?

Every time an LLM generates the next token, it computes a probability score for every word in its vocabulary — sometimes tens of thousands of candidates. Sampling strategies decide which subset of those candidates the model is allowed to choose from. Without any restriction the model could, in theory, pick any token including extremely unlikely and nonsensical ones.

Top-K sampling solves this by keeping only the K highest-probability tokens and redistributing all probability mass to that fixed shortlist before sampling. Top-P sampling (also called nucleus sampling) takes a different approach: instead of a fixed count, it keeps the smallest group of tokens whose cumulative probability reaches at least P. Both strategies prevent the model from emitting garbage tokens, but they handle uncertainty in subtly different ways — and understanding that difference is the key to tuning them well.

How Top-K Sampling Works

At each generation step the model ranks all tokens by probability and discards everything below rank K. The remaining K tokens are re-normalised so their probabilities sum to 1, and the model samples from this smaller pool. Because K is a fixed number, the pool size never changes — regardless of whether one token dominates with 90 % probability or fifty tokens share roughly equal probability.

Top-K Value	Candidate Pool	Output Character	Risk
1	1 token (greedy)	Fully deterministic	Repetitive, robotic
5 – 20	Small, high-quality pool	Focused and coherent	May cut good options
40 – 60	Medium pool	Balanced variety	Reasonable default
100+	Large pool	Creative, unpredictable	May include poor tokens
Vocabulary size	All tokens	Pure temperature sampling	No filtering at all

How Top-P Sampling Works

Top-P sampling sorts tokens by descending probability and walks down the list, adding tokens until their cumulative probability reaches P. All tokens below the cutoff are discarded and the survivors are re-normalised. The crucial difference from Top-K is that the pool size is adaptive: when the model is confident (one token has 80 % probability), only a handful of tokens are needed to reach P=0.95, so the pool is small. When the model is uncertain (probability is spread across many tokens), many more tokens are needed, so the pool grows automatically.

Top-P Value	Pool Behaviour	Output Character	Best For
0.1 – 0.5	Very small nucleus	Conservative, safe	Factual QA, classification
0.7 – 0.9	Moderate nucleus	Balanced and coherent	Summaries, chat
0.90 – 0.95	Standard nucleus	Creative with guardrails	General-purpose default
0.99 – 1.0	Near-full vocabulary	Highly diverse, risky	Experimental writing

Installation and Setup

All examples in this tutorial use Ollama with the llama3.2 model running locally. Install Ollama from ollama.com, pull the model, and install the Python client. The top_k and top_p parameters are available in the same way across the OpenAI, Anthropic, and Google Gemini APIs — only the client initialisation differs.


# pip install ollama

import ollama

def generate(
    prompt: str,
    temperature: float = 0.7,
    top_k: int = 40,
    top_p: float = 0.95,
    system: str = ""
) -> str:
    """Send a prompt with explicit sampling parameters and return the reply."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={
            "temperature": temperature,
            "top_k":       top_k,
            "top_p":       top_p,
        }
    )
    return response["message"]["content"]

The helper function generate() accepts all three sampling parameters plus an optional system prompt. We hold two parameters constant in each section and vary only the one under investigation so the effect is clearly isolated.

Effect of Top-K on Output

To isolate the effect of Top-K we fix temperature at 0.8 and Top-P at 1.0 (effectively disabled), then sweep Top-K from very small to very large across the same creative prompt. This shows how shrinking or expanding the fixed candidate pool changes the character of the output.


import ollama

def generate(prompt, temperature=0.8, top_k=40, top_p=1.0, system=""):
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature, "top_k": top_k, "top_p": top_p}
    )
    return response["message"]["content"]

prompt = "Write one sentence describing the ocean at midnight."

top_k_values = [1, 5, 20, 50, 100]

for k in top_k_values:
    reply = generate(prompt, temperature=0.8, top_k=k, top_p=1.0)
    print(f"[top_k={k:>3}]  {reply.strip()}")

Output:


[top_k=  1]  The ocean at midnight is a vast, dark expanse of silence.
[top_k=  5]  The ocean at midnight stretches endlessly into a black and restless dark.
[top_k= 20]  At midnight, the ocean breathes slowly, its surface fractured by cold starlight.
[top_k= 50]  The ocean at midnight holds its breath like a secret too heavy to whisper.
[top_k=100]  Beneath a bruised and lightless sky, the midnight ocean swallows every sound whole.

At top_k=1 the output is safe but flat — the model always picks the single most probable token. As Top-K grows, the model gains access to richer vocabulary choices and produces more vivid, varied imagery. Above top_k=100 the returns diminish and incoherence risk increases, especially at high temperatures.

Effect of Top-P on Output

To isolate Top-P we fix temperature at 0.8 and disable Top-K by setting it to a very large value (10000), then sweep Top-P from a very small nucleus to the full vocabulary. Unlike Top-K, the pool shrinks and grows adaptively with model confidence — which is clearly visible when examining the output at extreme ends of the range.


import ollama

def generate(prompt, temperature=0.8, top_k=10000, top_p=0.95, system=""):
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature, "top_k": top_k, "top_p": top_p}
    )
    return response["message"]["content"]

prompt = "Write one sentence describing the ocean at midnight."

top_p_values = [0.1, 0.3, 0.6, 0.9, 0.99]

for p in top_p_values:
    reply = generate(prompt, temperature=0.8, top_k=10000, top_p=p)
    print(f"[top_p={p}]  {reply.strip()}")

Output:


[top_p=0.1]   The ocean at midnight is dark and silent.
[top_p=0.3]   The ocean at midnight is a vast and silent expanse of darkness.
[top_p=0.6]   At midnight the ocean becomes a still mirror of starless sky.
[top_p=0.9]   The midnight ocean breathes in long, slow sighs beneath a canopy of cold stars.
[top_p=0.99]  Like a dreaming giant turning in its sleep, the midnight ocean stirs with slow
               and ancient restlessness beneath a sky that holds no moon.

At top_p=0.1 the nucleus is tiny — only the highest-probability tokens qualify — and the output is sparse and plain. At top_p=0.99 nearly the full vocabulary is available and the model produces rich, elaborate imagery. Notice how the sentence length itself grows with Top-P — more candidate tokens means more complex syntactic choices become available.

Comparing Top-K and Top-P Directly

The key practical difference between Top-K and Top-P is how they handle model confidence. On an easy, predictable step (e.g. completing a well-known phrase), Top-K still samples from K tokens even if most are poor choices, while Top-P automatically collapses to just the few high-probability tokens. On an ambiguous step, Top-K may be too restrictive while Top-P expands freely. The example below runs a factual prompt and a creative prompt through both strategies with matched settings to highlight this asymmetry.


import ollama

def generate(prompt, temperature=0.7, top_k=40, top_p=0.95):
    response = ollama.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": temperature, "top_k": top_k, "top_p": top_p}
    )
    return response["message"]["content"]

prompts = {
    "Factual":  "What is the boiling point of water at sea level? One sentence.",
    "Creative": "Describe the colour blue to someone who has never seen it. One sentence.",
}

configs = {
    "Top-K=10  (fixed small pool)": {"top_k": 10,    "top_p": 1.0},
    "Top-P=0.9 (adaptive nucleus)": {"top_k": 10000, "top_p": 0.9},
}

for prompt_label, prompt in prompts.items():
    print(f"── {prompt_label} prompt ──")
    for cfg_label, cfg in configs.items():
        reply = generate(prompt, temperature=0.7, **cfg)
        print(f"  [{cfg_label}]")
        print(f"  {reply.strip()}")
    print()

Output:


── Factual prompt ──
  [Top-K=10  (fixed small pool)]
  Water boils at 100 degrees Celsius (212 degrees Fahrenheit) at sea level.
  [Top-P=0.9 (adaptive nucleus)]
  Water boils at 100 degrees Celsius, or 212 degrees Fahrenheit, at sea level.

── Creative prompt ──
  [Top-K=10  (fixed small pool)]
  Blue is a calm, cool colour — like the feeling of a gentle breeze on a warm day.
  [Top-P=0.9 (adaptive nucleus)]
  Blue is the feeling of standing at the edge of something vast and unhurried,
  the way silence sounds just before rain.

On the factual prompt both strategies produce nearly identical correct answers — the model is highly confident and either pool covers the right tokens. On the creative prompt, Top-P at 0.9 unlocks a richer, more metaphorical sentence because it adaptively expands its nucleus when the model encounters genuinely ambiguous creative choices that Top-K=10 would prune away.

Combining Temperature, Top-K, and Top-P

In practice all three parameters are applied together: temperature rescales the probability distribution first, then Top-K removes low-ranked tokens, and finally Top-P trims the nucleus by cumulative mass. The order matters — a high temperature flattens probabilities before either filter runs, so the same Top-K or Top-P value has a broader effect at high temperatures than at low ones. Below we test five commonly used preset combinations across the same brainstorming prompt.


import ollama

def generate(prompt, temperature=0.7, top_k=40, top_p=0.95, system=""):
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature, "top_k": top_k, "top_p": top_p}
    )
    return response["message"]["content"]

system = "You are a creative product designer. Respond with one idea only."
prompt = "Suggest an unusual feature for a smart alarm clock."

presets = [
    {"label": "Deterministic",  "temperature": 0.0, "top_k": 1,   "top_p": 1.0},
    {"label": "Conservative",   "temperature": 0.3, "top_k": 20,  "top_p": 0.85},
    {"label": "Balanced",       "temperature": 0.7, "top_k": 40,  "top_p": 0.95},
    {"label": "Creative",       "temperature": 1.0, "top_k": 60,  "top_p": 0.97},
    {"label": "Experimental",   "temperature": 1.3, "top_k": 100, "top_p": 1.0},
]

for preset in presets:
    label = preset.pop("label")
    reply = generate(prompt, system=system, **preset)
    cfg_str = "  ".join(f"{k}={v}" for k, v in preset.items())
    print(f"[{label}]  {cfg_str}")
    print(f"  {reply.strip()}")
    print()

Output:


[Deterministic]  temperature=0.0  top_k=1  top_p=1.0
  A sunrise simulation lamp that gradually brightens over 30 minutes before
  the alarm sounds.

[Conservative]  temperature=0.3  top_k=20  top_p=0.85
  An alarm that releases a chosen scent — fresh coffee or citrus — two
  minutes before it rings.

[Balanced]  temperature=0.7  top_k=40  top_p=0.95
  A clock that reads your sleep stage via a wrist sensor and only triggers
  the alarm during the lightest phase within a 20-minute wake window.

[Creative]  temperature=1.0  top_k=60  top_p=0.97
  An alarm that silently texts your best friend if you snooze more than
  three times, applying gentle social accountability.

[Experimental]  temperature=1.3  top_k=100  top_p=1.0
  A clock that slowly deflates your pillow over fifteen minutes — no sound,
  no light, just the faint conspiracy of gravity and comfort slowly parting ways.

Each preset produces a qualitatively different kind of idea: the Deterministic preset gives the textbook answer, Conservative gives a sensible product feature, Balanced gives a technically interesting concept, Creative gives a playful social-tech idea, and Experimental produces the most original — and slightly absurd — concept of the five.

Choosing the Right Sampling Parameters

Most LLM providers recommend using either Top-K or Top-P as the primary truncation strategy, not both at full strength simultaneously, to avoid over-constraining the sampling pool. The table below summarises practical recommendations for common use cases as starting points that can be tuned from there.

Use Case	Temperature	Top-K	Top-P	Priority Strategy
JSON / data extraction	0.0	1	1.0	Greedy (no sampling)
Code generation	0.1 – 0.2	10 – 20	0.9	Top-P primary
Factual Q&A / RAG	0.1 – 0.3	20	0.85	Top-P primary
Summarisation	0.3 – 0.5	40	0.9	Both moderate
General chat	0.7	40	0.95	Balanced default
Marketing copy	0.9	60	0.95	Top-K primary
Creative writing	1.0 – 1.2	80	0.97	Top-P primary
Brainstorming	1.2 – 1.4	100	1.0	Max diversity

When in doubt, start with the Balanced preset (temperature=0.7, top_k=40, top_p=0.95) and adjust one parameter at a time. If outputs feel repetitive, raise Top-K or Top-P. If outputs feel incoherent, lower temperature first, then tighten Top-P. Always evaluate on a representative sample of real prompts rather than a single example.

Conclusion

In this post, we briefly learned what Top-K and Top-P sampling are, how they filter the token candidate pool before each sampling step, and how they differ in their adaptive versus fixed behaviour. We isolated the effect of each parameter independently, compared them directly on factual and creative prompts, combined them with temperature into practical presets, and built a reference table mapping common tasks to recommended starting values. Together, temperature, Top-K, and Top-P form a complete toolkit for controlling LLM output quality and diversity from a single function call.

DataTechNotes

Pages

How to Use Top-P and Top-K Sampling in LLMs

What are Top-K and Top-P Sampling?

How Top-K Sampling Works

How Top-P Sampling Works

Installation and Setup

Effect of Top-K on Output

Effect of Top-P on Output

Comparing Top-K and Top-P Directly

Combining Temperature, Top-K, and Top-P

Choosing the Right Sampling Parameters

Conclusion

No comments:

Post a Comment