DataTechNotes: How to Control LLM Output Randomness with Temperature in Python

In this post, we'll briefly learn what temperature is in the context of large language models, how it controls the randomness of generated text, and how to set it correctly for different tasks in Python. The tutorial covers:

What is Temperature?
How Temperature Works
Installation and Setup
Comparing Temperature Values Side by Side
Low Temperature for Factual and Structured Tasks
High Temperature for Creative Tasks
Temperature and Top-p Sampling
Choosing the Right Temperature
Conclusion

Let's get started.

What is Temperature?

Temperature is a single floating-point hyperparameter — typically between 0.0 and 2.0 — that controls how randomly an LLM picks its next token during text generation. A low temperature makes the model cautious and repetitive, always favouring the most probable next word. A high temperature makes it adventurous and unpredictable, giving unlikely words a real chance of being chosen. Setting it correctly is one of the most practical levers available to any developer working with language models.

The name comes from statistical thermodynamics: in physics, higher temperature means particles move more randomly. In LLMs, higher temperature means the probability distribution over the next token is flattened — more tokens become plausible, so the model explores further from its default, most-likely answer.

How Temperature Works

At each generation step, the model computes a raw score — called a logit — for every token in its vocabulary. These logits are converted into probabilities using the softmax function. Temperature is applied by dividing every logit by the temperature value T before the softmax is computed.

When T < 1.0 the logits are made larger in magnitude, which sharpens the softmax distribution — the highest-probability token dominates and others are suppressed. When T > 1.0 the logits are made smaller, which flattens the distribution — many tokens share similar probabilities and the model samples more freely. At T = 0.0 the model always picks the single token with the highest logit (greedy decoding).

Temperature	Distribution Shape	Output Character	Best For
0.0	Greedy — one winner	Fully deterministic	JSON extraction, classification
0.1 – 0.4	Sharp — top tokens dominate	Focused, predictable	Factual QA, code generation
0.5 – 0.8	Balanced	Coherent with mild variety	General chat, summaries
0.9 – 1.2	Flat — many tokens compete	Creative, varied	Story writing, brainstorming
1.3 – 2.0	Very flat — rare tokens likely	Highly random, may be incoherent	Experimental / artistic use

Installation and Setup

All examples in this tutorial use Ollama with the llama3.2 model running locally. Install Ollama from ollama.com, pull the model, and install the Python client. The temperature parameter is identical across Ollama, OpenAI, Anthropic, and other providers — only the client setup differs.


# Terminal — pull the model once
# ollama pull llama3.2

pip install ollama


import ollama

def generate(prompt: str, temperature: float, system: str = "") -> str:
    """Send a prompt at a given temperature and return the reply."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature}
    )
    return response["message"]["content"]

We define a reusable generate() helper that accepts a prompt, a temperature value, and an optional system prompt. All sections below call this helper so we can isolate the effect of temperature cleanly.

Comparing Temperature Values Side by Side

The clearest way to understand temperature is to run the exact same prompt at several different values and compare the outputs. The prompt below asks the model to complete a sentence — a task sensitive enough to show clear differences across the temperature range.


import ollama

def generate(prompt: str, temperature: float, system: str = "") -> str:
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature}
    )
    return response["message"]["content"]

prompt = "Continue this sentence in one line: The future of artificial intelligence is"

temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]

for t in temperatures:
    reply = generate(prompt, temperature=t)
    print(f"[T={t}]  {reply.strip()}")

Output:


[T=0.0]  ...a rapidly evolving field that will transform every industry on Earth.
[T=0.3]  ...a rapidly evolving landscape that will reshape how we work and live.
[T=0.7]  ...both thrilling and uncertain, promising breakthroughs we can barely imagine.
[T=1.0]  ...a canvas painted by human curiosity, machine ingenuity, and ethical courage.
[T=1.5]  ...an unfolding tapestry of chaotic brilliance stitched between wonder and dread.

At T=0.0 and T=0.3 the completions are safe and conventional. By T=1.0 the language becomes more figurative, and at T=1.5 it edges toward the poetic — or occasionally incoherent, depending on the run.

Low Temperature for Factual and Structured Tasks

Tasks that require accuracy, consistency, or machine-parseable output should always use a low temperature — ideally 0.0 to 0.2. This includes JSON extraction, code generation, classification, maths reasoning, and any task where a wrong but creative answer is worse than a right but boring one. Below we run the same code generation prompt five times at T=0.0 and T=1.2 to illustrate the stability difference.


import ollama

def generate(prompt: str, temperature: float, system: str = "") -> str:
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature}
    )
    return response["message"]["content"]

prompt = (
    "Write a Python one-liner that reads a CSV file called data.csv "
    "into a pandas DataFrame. Return only the code, nothing else."
)

print("=== T=0.0 (deterministic) — run 5 times ===")
for i in range(5):
    print(f"  Run {i+1}: {generate(prompt, temperature=0.0).strip()}")

print()
print("=== T=1.2 (high randomness) — run 5 times ===")
for i in range(5):
    print(f"  Run {i+1}: {generate(prompt, temperature=1.2).strip()}")

Output:


=== T=0.0 (deterministic) — run 5 times ===
  Run 1: df = pd.read_csv("data.csv")
  Run 2: df = pd.read_csv("data.csv")
  Run 3: df = pd.read_csv("data.csv")
  Run 4: df = pd.read_csv("data.csv")
  Run 5: df = pd.read_csv("data.csv")

=== T=1.2 (high randomness) — run 5 times ===
  Run 1: df = pd.read_csv("data.csv")
  Run 2: import pandas as pd; df = pd.read_csv('data.csv')
  Run 3: df = pd.read_csv("data.csv", sep=",")
  Run 4: data = pd.read_csv("data.csv"); print(data.head())
  Run 5: df=pd.read_csv('data.csv')

At T=0.0 every run produces the identical answer — critical for unit-testable or pipeline-integrated code generation. At T=1.2 the answers are all technically correct but vary in quoting style, variable names, and added extras — introducing unnecessary inconsistency for a task that has one obvious right answer.

High Temperature for Creative Tasks

Creative tasks — story writing, brainstorming, marketing copy, poetry, and idea generation — benefit from higher temperature values because variety and surprise are desirable properties. Below we generate five distinct product taglines for the same brief at T=0.2 and T=1.0 to demonstrate the difference in creative diversity.


import ollama

def generate(prompt: str, temperature: float, system: str = "") -> str:
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = ollama.chat(
        model="llama3.2",
        messages=messages,
        options={"temperature": temperature}
    )
    return response["message"]["content"]

system = "You are a creative copywriter. Respond with a single tagline only."
prompt = "Write a catchy tagline for a smart water bottle that tracks hydration."

print("=== T=0.2 — low creativity ===")
for i in range(5):
    print(f"  {i+1}. {generate(prompt, temperature=0.2, system=system).strip()}")

print()
print("=== T=1.0 — high creativity ===")
for i in range(5):
    print(f"  {i+1}. {generate(prompt, temperature=1.0, system=system).strip()}")

Output:


=== T=0.2 — low creativity ===
  1. Stay hydrated, stay ahead.
  2. Stay hydrated, stay ahead.
  3. Drink smarter, live better.
  4. Stay hydrated, stay ahead.
  5. Drink smarter, live better.

=== T=1.0 — high creativity ===
  1. Your thirst has met its match.
  2. Sip. Track. Thrive.
  3. Because your body keeps the score.
  4. Every drop, accounted for.
  5. Hydration finally has a brain.

At low temperature the model cycles between two safe formulas. At T=1.0 every run produces a genuinely distinct tagline — each one different in angle, rhythm, and vocabulary. For a brainstorming session where you want ten distinct ideas to evaluate, higher temperature is clearly the better choice.

Temperature and Top-p Sampling

Temperature is often used alongside top-p (also called nucleus sampling), a complementary randomness control. While temperature rescales the entire probability distribution, top-p sets a cumulative probability threshold and restricts sampling to only the smallest set of tokens whose probabilities sum to at least p. This means extremely unlikely tokens are always excluded, even at high temperatures.


import ollama

def generate_with_params(
    prompt: str,
    temperature: float,
    top_p: float,
    top_k: int = 40
) -> str:
    response = ollama.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": prompt}],
        options={
            "temperature": temperature,
            "top_p":       top_p,
            "top_k":       top_k
        }
    )
    return response["message"]["content"]

prompt = "Describe the feeling of solving a hard bug in one sentence."

configs = [
    {"label": "Deterministic",      "temperature": 0.0, "top_p": 1.0,  "top_k": 1},
    {"label": "Conservative",       "temperature": 0.3, "top_p": 0.9,  "top_k": 20},
    {"label": "Balanced (default)", "temperature": 0.7, "top_p": 0.95, "top_k": 40},
    {"label": "Creative",           "temperature": 1.0, "top_p": 0.95, "top_k": 60},
    {"label": "Highly Random",      "temperature": 1.4, "top_p": 1.0,  "top_k": 100},
]

for cfg in configs:
    reply = generate_with_params(prompt, cfg["temperature"], cfg["top_p"], cfg["top_k"])
    print(f"[{cfg['label']}]")
    print(f"  T={cfg['temperature']}  top_p={cfg['top_p']}  top_k={cfg['top_k']}")
    print(f"  {reply.strip()}")
    print()

Output:


[Deterministic]
  T=0.0  top_p=1.0  top_k=1
  It feels like lifting a fog that has been clouding your mind for hours.

[Conservative]
  T=0.3  top_p=0.9  top_k=20
  It is a quiet triumph — the relief of order restored after hours of chaos.

[Balanced (default)]
  T=0.7  top_p=0.95  top_k=40
  Like cracking open a locked room and finally seeing sunlight pour through.

[Creative]
  T=1.0  top_p=0.95  top_k=60
  A small, private fireworks display that nobody else in the room can see.

[Highly Random]
  T=1.4  top_p=1.0  top_k=100
  Equal parts euphoria and embarrassment — why did it take you this long?

The top_k parameter further restricts sampling to the k most probable tokens at each step. Most practitioners find that a balanced preset of temperature=0.7, top_p=0.95, top_k=40 works well as a starting point and then adjust from there depending on the task.

Choosing the Right Temperature

Selecting the right temperature is not guesswork — it follows directly from the nature of the task. The decision comes down to one question: is there a correct answer, or is variety the goal? The table below maps common LLM use cases to their recommended temperature ranges as practical starting points.

Use Case	Recommended Temperature	Reason
JSON / data extraction	0.0	Must be deterministic and parseable
Code generation	0.1 – 0.2	Correct syntax matters more than novelty
Factual Q&A / RAG	0.1 – 0.3	Accuracy over creativity
Summarisation	0.3 – 0.5	Faithful to source with mild variety
General chat / assistants	0.5 – 0.8	Natural, engaging, not robotic
Marketing copy / taglines	0.8 – 1.0	Variety and freshness are desirable
Story / creative writing	0.9 – 1.2	Imagination and surprise are the goal
Brainstorming / ideation	1.0 – 1.4	Maximise idea diversity

These are starting points, not fixed rules. Always test a few values on a representative sample of your actual prompts and evaluate the outputs against your quality criteria before settling on a final value. Different models also respond differently to the same temperature — a value that works well for llama3.2 may need tuning when switching to Mistral or GPT-4o.

Conclusion

In this post, we briefly learned what temperature is in the context of large language models and how it controls output randomness by rescaling token probability distributions. We compared temperature values side by side, demonstrated T=0.0 determinism for code generation, explored high-temperature diversity for creative tasks, combined temperature with top_p and top_k sampling, and built a practical reference table mapping tasks to recommended ranges. Temperature is one of the cheapest and highest-impact knobs in applied LLM development — understanding it well leads to faster, better-calibrated systems.

DataTechNotes

Pages

How to Control LLM Output Randomness with Temperature in Python

What is Temperature?

How Temperature Works

Installation and Setup

Comparing Temperature Values Side by Side

Low Temperature for Factual and Structured Tasks

High Temperature for Creative Tasks

Temperature and Top-p Sampling

Choosing the Right Temperature

Conclusion

No comments:

Post a Comment