In this post, we'll briefly learn what temperature is in the context of large language models, how it controls the randomness of generated text, and how to set it correctly for different tasks in Python. The tutorial covers:
- What is Temperature?
- How Temperature Works
- Installation and Setup
- Comparing Temperature Values Side by Side
- Low Temperature for Factual and Structured Tasks
- High Temperature for Creative Tasks
- Temperature and Top-p Sampling
- Choosing the Right Temperature
- Conclusion
Let's get started.
What is Temperature?
Temperature is a single floating-point hyperparameter — typically between
0.0
and 2.0
— that controls how randomly an LLM picks its next token during text generation.
A low temperature makes the model cautious and repetitive, always favouring the
most probable next word. A high temperature makes it adventurous and unpredictable,
giving unlikely words a real chance of being chosen. Setting it correctly is one of
the most practical levers available to any developer working with language models.
The name comes from statistical thermodynamics: in physics, higher temperature means particles move more randomly. In LLMs, higher temperature means the probability distribution over the next token is flattened — more tokens become plausible, so the model explores further from its default, most-likely answer.
How Temperature Works
At each generation step, the model computes a raw score — called a logit — for every token in its vocabulary. These logits are converted into probabilities using the softmax function. Temperature is applied by dividing every logit by the temperature value T before the softmax is computed.
When T < 1.0 the logits are made larger in magnitude, which sharpens the softmax distribution — the highest-probability token dominates and others are suppressed. When T > 1.0 the logits are made smaller, which flattens the distribution — many tokens share similar probabilities and the model samples more freely. At T = 0.0 the model always picks the single token with the highest logit (greedy decoding).
| Temperature | Distribution Shape | Output Character | Best For |
|---|---|---|---|
| 0.0 | Greedy — one winner | Fully deterministic | JSON extraction, classification |
| 0.1 – 0.4 | Sharp — top tokens dominate | Focused, predictable | Factual QA, code generation |
| 0.5 – 0.8 | Balanced | Coherent with mild variety | General chat, summaries |
| 0.9 – 1.2 | Flat — many tokens compete | Creative, varied | Story writing, brainstorming |
| 1.3 – 2.0 | Very flat — rare tokens likely | Highly random, may be incoherent | Experimental / artistic use |
Installation and Setup
All examples in this tutorial use Ollama with the
llama3.2
model running locally. Install Ollama from
ollama.com, pull the model, and install
the Python client. The temperature parameter is identical across Ollama, OpenAI,
Anthropic, and other providers — only the client setup differs.
# Terminal — pull the model once
# ollama pull llama3.2
pip install ollama
import ollama
def generate(prompt: str, temperature: float, system: str = "") -> str:
"""Send a prompt at a given temperature and return the reply."""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
We define a reusable
generate()
helper that accepts a prompt, a temperature value, and an optional system prompt. All
sections below call this helper so we can isolate the effect of temperature cleanly.
Comparing Temperature Values Side by Side
The clearest way to understand temperature is to run the exact same prompt at several different values and compare the outputs. The prompt below asks the model to complete a sentence — a task sensitive enough to show clear differences across the temperature range.
import ollama
def generate(prompt: str, temperature: float, system: str = "") -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
prompt = "Continue this sentence in one line: The future of artificial intelligence is"
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]
for t in temperatures:
reply = generate(prompt, temperature=t)
print(f"[T={t}] {reply.strip()}")
Output:
[T=0.0] ...a rapidly evolving field that will transform every industry on Earth.
[T=0.3] ...a rapidly evolving landscape that will reshape how we work and live.
[T=0.7] ...both thrilling and uncertain, promising breakthroughs we can barely imagine.
[T=1.0] ...a canvas painted by human curiosity, machine ingenuity, and ethical courage.
[T=1.5] ...an unfolding tapestry of chaotic brilliance stitched between wonder and dread.
At T=0.0
and T=0.3
the completions are safe and conventional. By
T=1.0
the language becomes more figurative, and at
T=1.5
it edges toward the poetic — or occasionally incoherent, depending on the run.
Low Temperature for Factual and Structured Tasks
Tasks that require accuracy, consistency, or machine-parseable output should always use
a low temperature — ideally
0.0
to 0.2.
This includes JSON extraction, code generation, classification, maths reasoning, and
any task where a wrong but creative answer is worse than a right but boring one. Below
we run the same code generation prompt five times at
T=0.0
and T=1.2
to illustrate the stability difference.
import ollama
def generate(prompt: str, temperature: float, system: str = "") -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
prompt = (
"Write a Python one-liner that reads a CSV file called data.csv "
"into a pandas DataFrame. Return only the code, nothing else."
)
print("=== T=0.0 (deterministic) — run 5 times ===")
for i in range(5):
print(f" Run {i+1}: {generate(prompt, temperature=0.0).strip()}")
print()
print("=== T=1.2 (high randomness) — run 5 times ===")
for i in range(5):
print(f" Run {i+1}: {generate(prompt, temperature=1.2).strip()}")
Output:
=== T=0.0 (deterministic) — run 5 times ===
Run 1: df = pd.read_csv("data.csv")
Run 2: df = pd.read_csv("data.csv")
Run 3: df = pd.read_csv("data.csv")
Run 4: df = pd.read_csv("data.csv")
Run 5: df = pd.read_csv("data.csv")
=== T=1.2 (high randomness) — run 5 times ===
Run 1: df = pd.read_csv("data.csv")
Run 2: import pandas as pd; df = pd.read_csv('data.csv')
Run 3: df = pd.read_csv("data.csv", sep=",")
Run 4: data = pd.read_csv("data.csv"); print(data.head())
Run 5: df=pd.read_csv('data.csv')
At T=0.0
every run produces the identical answer — critical for unit-testable or
pipeline-integrated code generation. At
T=1.2
the answers are all technically correct but vary in quoting style, variable names,
and added extras — introducing unnecessary inconsistency for a task that has one
obvious right answer.
High Temperature for Creative Tasks
Creative tasks — story writing, brainstorming, marketing copy, poetry, and idea
generation — benefit from higher temperature values because variety and surprise are
desirable properties. Below we generate five distinct product taglines for the same
brief at
T=0.2
and T=1.0
to demonstrate the difference in creative diversity.
import ollama
def generate(prompt: str, temperature: float, system: str = "") -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
system = "You are a creative copywriter. Respond with a single tagline only."
prompt = "Write a catchy tagline for a smart water bottle that tracks hydration."
print("=== T=0.2 — low creativity ===")
for i in range(5):
print(f" {i+1}. {generate(prompt, temperature=0.2, system=system).strip()}")
print()
print("=== T=1.0 — high creativity ===")
for i in range(5):
print(f" {i+1}. {generate(prompt, temperature=1.0, system=system).strip()}")
Output:
=== T=0.2 — low creativity ===
1. Stay hydrated, stay ahead.
2. Stay hydrated, stay ahead.
3. Drink smarter, live better.
4. Stay hydrated, stay ahead.
5. Drink smarter, live better.
=== T=1.0 — high creativity ===
1. Your thirst has met its match.
2. Sip. Track. Thrive.
3. Because your body keeps the score.
4. Every drop, accounted for.
5. Hydration finally has a brain.
At low temperature the model cycles between two safe formulas. At
T=1.0
every run produces a genuinely distinct tagline — each one different in angle, rhythm,
and vocabulary. For a brainstorming session where you want ten distinct ideas to
evaluate, higher temperature is clearly the better choice.
Temperature and Top-p Sampling
Temperature is often used alongside top-p (also called nucleus sampling),
a complementary randomness control. While temperature rescales the entire probability
distribution, top-p sets a cumulative probability threshold and restricts sampling to
only the smallest set of tokens whose probabilities sum to at least
p.
This means extremely unlikely tokens are always excluded, even at high temperatures.
import ollama
def generate_with_params(
prompt: str,
temperature: float,
top_p: float,
top_k: int = 40
) -> str:
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
options={
"temperature": temperature,
"top_p": top_p,
"top_k": top_k
}
)
return response["message"]["content"]
prompt = "Describe the feeling of solving a hard bug in one sentence."
configs = [
{"label": "Deterministic", "temperature": 0.0, "top_p": 1.0, "top_k": 1},
{"label": "Conservative", "temperature": 0.3, "top_p": 0.9, "top_k": 20},
{"label": "Balanced (default)", "temperature": 0.7, "top_p": 0.95, "top_k": 40},
{"label": "Creative", "temperature": 1.0, "top_p": 0.95, "top_k": 60},
{"label": "Highly Random", "temperature": 1.4, "top_p": 1.0, "top_k": 100},
]
for cfg in configs:
reply = generate_with_params(prompt, cfg["temperature"], cfg["top_p"], cfg["top_k"])
print(f"[{cfg['label']}]")
print(f" T={cfg['temperature']} top_p={cfg['top_p']} top_k={cfg['top_k']}")
print(f" {reply.strip()}")
print()
Output:
[Deterministic]
T=0.0 top_p=1.0 top_k=1
It feels like lifting a fog that has been clouding your mind for hours.
[Conservative]
T=0.3 top_p=0.9 top_k=20
It is a quiet triumph — the relief of order restored after hours of chaos.
[Balanced (default)]
T=0.7 top_p=0.95 top_k=40
Like cracking open a locked room and finally seeing sunlight pour through.
[Creative]
T=1.0 top_p=0.95 top_k=60
A small, private fireworks display that nobody else in the room can see.
[Highly Random]
T=1.4 top_p=1.0 top_k=100
Equal parts euphoria and embarrassment — why did it take you this long?
The top_k
parameter further restricts sampling to the k most probable tokens at each step.
Most practitioners find that a balanced preset of
temperature=0.7,
top_p=0.95,
top_k=40
works well as a starting point and then adjust from there depending on the task.
Choosing the Right Temperature
Selecting the right temperature is not guesswork — it follows directly from the nature of the task. The decision comes down to one question: is there a correct answer, or is variety the goal? The table below maps common LLM use cases to their recommended temperature ranges as practical starting points.
| Use Case | Recommended Temperature | Reason |
|---|---|---|
| JSON / data extraction | 0.0 | Must be deterministic and parseable |
| Code generation | 0.1 – 0.2 | Correct syntax matters more than novelty |
| Factual Q&A / RAG | 0.1 – 0.3 | Accuracy over creativity |
| Summarisation | 0.3 – 0.5 | Faithful to source with mild variety |
| General chat / assistants | 0.5 – 0.8 | Natural, engaging, not robotic |
| Marketing copy / taglines | 0.8 – 1.0 | Variety and freshness are desirable |
| Story / creative writing | 0.9 – 1.2 | Imagination and surprise are the goal |
| Brainstorming / ideation | 1.0 – 1.4 | Maximise idea diversity |
These are starting points, not fixed rules. Always test a few values on a representative
sample of your actual prompts and evaluate the outputs against your quality criteria
before settling on a final value. Different models also respond differently to the same
temperature — a value that works well for
llama3.2
may need tuning when switching to Mistral or GPT-4o.
Conclusion
In this post, we briefly learned what temperature is in the context of large
language models and how it controls output randomness by rescaling token probability
distributions. We compared temperature values side by side, demonstrated
T=0.0
determinism for code generation, explored high-temperature diversity for creative
tasks, combined temperature with
top_p
and top_k
sampling, and built a practical reference table mapping tasks to recommended ranges.
Temperature is one of the cheapest and highest-impact knobs in applied LLM development
— understanding it well leads to faster, better-calibrated systems.
No comments:
Post a Comment