In this post, we'll briefly learn what max tokens means in the context of large language models, how it controls the length of generated responses, and how to set it effectively for different tasks in Python. The tutorial covers:
- What are Max Tokens?
- How Tokens are Counted
- Installation and Setup
- Setting Max Tokens for Short Responses
- Setting Max Tokens for Long Responses
- Detecting a Truncated Response
- Max Tokens for Structured Output Control
- Estimating Token Count Before Sending
- Choosing the Right Max Tokens Value
- Conclusion
Let's get started.
What are Max Tokens?
Max tokens is a parameter that sets a hard upper limit on the number of tokens an LLM is allowed to produce in a single response. Once the model has generated that many tokens it stops immediately, even if the response is grammatically incomplete. It does not affect what the model wants to say — only how much of it you are willing to receive.
Controlling response length matters for several practical reasons. In production systems it caps latency and API cost, since you pay per output token on most cloud providers. In local deployments it bounds memory usage and inference time. In user interfaces it prevents walls of text from overwhelming a chat window. And in structured pipelines it enforces compact, machine-parseable output that fits cleanly inside a downstream prompt or database field.
How Tokens are Counted
A token is not the same as a word. LLMs use a sub-word tokeniser — most commonly Byte-Pair Encoding (BPE) — that splits text into chunks based on frequency in the training corpus. Common short words like "the" or "is" are typically one token. Longer or rarer words are split into multiple tokens. Punctuation and whitespace also consume tokens. As a rough rule of thumb, one token ≈ 0.75 English words, so 100 tokens ≈ 75 words and 1 000 tokens ≈ 750 words.
| Max Tokens | Approx. Words | Typical Output Length | Good For |
|---|---|---|---|
| 20 – 50 | 15 – 37 | One sentence | Labels, tags, short answers |
| 50 – 150 | 37 – 112 | One short paragraph | Summaries, definitions |
| 150 – 400 | 112 – 300 | Two to four paragraphs | Chat replies, explanations |
| 400 – 1 000 | 300 – 750 | A full page of text | Blog posts, reports, code |
| 1 000 – 4 000 | 750 – 3 000 | Several pages | Long-form content, documents |
Installation and Setup
All examples in this tutorial use Ollama with the
llama3.2
model running locally. Install Ollama from
ollama.com, pull the model, and
install the Python client. We also install
tiktoken
for token counting in a later section.
# Terminal — pull the model once
# ollama pull llama3.2
# pip install ollama tiktoken
import ollama
def generate(
prompt: str,
max_tokens: int,
temperature: float = 0.7,
system: str = ""
) -> dict:
"""Send a prompt with a max_tokens limit and return the full response."""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={
"num_predict": max_tokens, # Ollama's name for max_tokens
"temperature": temperature,
}
)
return response
In Ollama the parameter is called
num_predict
rather than
max_tokens,
but they are equivalent. When using the OpenAI Python client pointed at any provider,
the parameter is always named
max_tokens.
The helper returns the full response object so we can inspect both the text and the
stop reason in each example.
Setting Max Tokens for Short Responses
Small max-token limits are useful for classification labels, one-word answers, short definitions, and any task where a concise reply is more valuable than a thorough one. Below we send the same question at three progressively tighter limits and observe how the model adapts — and where it gets cut off.
import ollama
def generate(prompt, max_tokens, temperature=0.7, system=""):
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"num_predict": max_tokens, "temperature": temperature}
)
return response
prompt = "Explain what machine learning is."
for limit in [10, 30, 80]:
resp = generate(prompt, max_tokens=limit, temperature=0.3)
text = resp["message"]["content"]
words = len(text.split())
print(f"── max_tokens={limit} ──")
print(f" {text.strip()}")
print(f" (≈ {words} words)")
print()
Output:
── max_tokens=10 ──
Machine learning (ML) is a subset of artificial
(≈ 8 words)
── max_tokens=30 ──
Machine learning (ML) is a subset of artificial intelligence (AI) that involves training
algorithms to learn from data, identify patterns, and make predictions or
(≈ 24 words)
── max_tokens=80 ──
Machine learning (ML) is a subset of artificial intelligence (AI) that involves training
algorithms to learn from data and make predictions or decisions without being explicitly
programmed. In traditional programming, a computer program is written to perform a specific
task by following a set of rules and instructions. In contrast, machine learning allows
computers to automatically improve their performance on a task by analyzing large amounts
of data, identifying patterns,
(≈ 70 words)
At max_tokens=10
the response is cut mid-sentence — the model had more to say but was stopped by the
hard limit. At 30 tokens the model lands on a grammatically complete sentence. At 80
tokens it has room for a fuller explanation with examples. Choosing the right limit
means knowing roughly how many tokens a satisfactory answer requires for your
specific task.
Setting Max Tokens for Long Responses
For long-form tasks — detailed explanations, multi-step code, or structured reports — you need a generous max-token budget. Setting it too low silently truncates the output, often at the worst possible moment: mid-function, mid-list, or mid-conclusion. Below we ask the model to write a structured technical summary at three different budgets to show how content completeness changes with the limit.
import ollama
def generate(prompt, max_tokens, temperature=0.5, system=""):
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"num_predict": max_tokens, "temperature": temperature}
)
return response
system = (
"You are a technical writer. Structure your response with: "
"Overview, Key Concepts, and Use Cases sections."
)
prompt = "Write a technical summary of transformer neural networks."
for limit in [100, 300, 600]:
resp = generate(prompt, max_tokens=limit, system=system)
text = resp["message"]["content"]
print(f"══ max_tokens={limit} ══════════════════")
print(text.strip())
print(f"\n[Total words generated: {len(text.split())}]")
print()
Output:
══ max_tokens=100 ══════════════════
**Transformer Neural Networks**
**Overview**
------------
Transformers are a type of neural network architecture introduced in 2017 by Vaswani et al.
in the paper "Attention is All You Need." The primary innovation of transformers is their
use of self-attention mechanisms to process sequential data, replacing traditional recurrent
neural networks (RNNs) and convolutional neural networks (CNNs). This allows for
parallelization of computations, improved performance on long-range dependencies, and
reduced memory requirements.
**Key
[Total words generated: 70]
══ max_tokens=300 ══════════════════
**Transformer Neural Networks**
Overview
--------
Transformers are a type of neural network architecture introduced in the paper
"Attention is All You Need" by Vaswani et al. in 2017. They revolutionized the field
of natural language processing (NLP) and have since been widely adopted for various
tasks, including machine translation, text classification, and question answering.
Transformers are designed to handle sequential data efficiently, such as text or
speech, by replacing traditional recurrent neural networks (RNNs) with self-attention
mechanisms. This allows the model to attend to multiple positions in the input
sequence simultaneously, rather than processing one position at a time.
Key Concepts
------------
* **Self-Attention Mechanism**: The core component of transformers, which enables
the model to weigh the importance of different input elements relative to each other.
* **Multi-Head Attention**: A technique that applies multiple attention mechanisms
in parallel, allowing the model to capture different types of relationships between
input elements simultaneously.
* **Positional Encoding**: An added layer of information that preserves the order of
input elements, enabling the model to maintain context during processing.
* **Layer Normalization**: A normalization technique used to stabilize training
and improve model performance.
Use Cases
---------
Transformers have been successfully applied to a wide range of NLP tasks, including:
1. **Machine Translation**: Transformers have achieved state-of-the-art results
in machine translation tasks, such as translating text from one language to another.
2
[Total words generated: 226]
══ max_tokens=600 ══════════════════
**Transformer Neural Networks**
The Transformer is a type of neural network architecture introduced in the paper
"Attention is All You Need" by Vaswani et al. in 2017. It has revolutionized the
field of natural language processing (NLP) and has been widely adopted in various
applications, including machine translation, text summarization, and question
answering.
**Overview**
The Transformer replaces traditional recurrent neural networks (RNNs) with
self-attention mechanisms to model complex interactions between input sequences.
The architecture consists of an encoder and a decoder, which are identical in
the original paper. The encoder takes in a sequence of tokens (e.g., words or
characters) and outputs a sequence of vectors that represent the meaning of
the input text.
**Key Concepts**
1. **Self-Attention Mechanism**: The self-attention mechanism allows the model
to weigh the importance of different input sequences relative to each other.
This is achieved by computing the dot product of query, key, and value vectors
for all pairs of input tokens.
2. **Multi-Head Attention**: The Transformer uses multi-head attention, which
allows the model to jointly attend to information from different representation
subspaces at different positions. This is done by applying multiple attention
mechanisms in parallel and combining their outputs.
3. **Positional Encoding**: Positional encoding is used to preserve the spatial
structure of input sequences during self-attention computations.
4. **Encoder-Decoder Structure**: The Transformer uses an encoder-decoder structure,
where the encoder takes in a sequence of tokens and outputs a sequence of vectors,
which are then passed through the decoder to generate the final output.
**Use Cases**
1. **Machine Translation**: The Transformer has been widely adopted for machine
translation tasks, achieving state-of-the-art results in many benchmarks.
2. **Text Summarization**: The Transformer can be used for text summarization tasks,
such as generating a summary of a long piece of text.
3. **Question Answering**: The Transformer can be used for question answering tasks,
such as identifying the answer to a given question based on a passage of text.
4. **Language Modeling**: The Transformer has been used for language modeling tasks,
such as predicting the next word in a sequence.
**Advantages**
1. **Parallelization**: The Transformer can be parallelized more easily than
traditional RNN architectures, making it faster and more efficient to train.
2. **Scalability**: The Transformer can handle longer input sequences than
traditional RNN architectures, making it suitable for tasks that require
processing long-range dependencies.
**Limitations**
1. **Computational Cost**: The Transformer requires significant computational
resources to train and deploy, especially for large-scale applications.
2. **Lack of Understanding**: Despite its success, the Transformer's architecture
is still not fully understood, and there is ongoing research to improve our
understanding of its inner workings.)
[Total words generated: 432]
At 100 tokens the Key Concepts section is cut mid-bullet and Use Cases never appears. At 300 tokens all three sections are present but concise. At 600 tokens each section is fully developed with nuance and extra examples. The right budget depends on whether you need a scannable overview or a thorough reference.
Detecting a Truncated Response
When a response is cut off by the max-token limit, the model's
stop reason changes from
stop
(natural end of reply) to
length
(hit the token budget). Checking the stop reason in code lets you detect truncation
programmatically and either warn the user, retry with a higher limit, or request a
continuation automatically.
import ollama
def generate_and_check(prompt: str, max_tokens: int) -> None:
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
options={"num_predict": max_tokens, "temperature": 0.3}
)
text = response["message"]["content"]
stop_reason = response.get("done_reason", "unknown")
print(f"max_tokens : {max_tokens}")
print(f"Stop reason : {stop_reason}")
print(f"Output : {text.strip()}")
if stop_reason == "length":
print("⚠ Response was truncated — consider increasing max_tokens.")
else:
print("✓ Response completed naturally.")
print()
prompt = "List the planets in our solar system in order from the Sun."
generate_and_check(prompt, max_tokens=20)
generate_and_check(prompt, max_tokens=80)
Output:
max_tokens : 20
Stop reason : length
Output : The planets in order from the Sun are: Mercury, Venus, Earth,
Mars, Jupiter,
⚠ Response was truncated — consider increasing max_tokens.
max_tokens : 80
Stop reason : stop
Output : The planets in our solar system in order from the Sun are:
Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.
✓ Response completed naturally.
In Ollama the stop reason is returned in the
done_reason
field of the response object. When using the OpenAI client the equivalent field is
choices[0].finish_reason,
which returns the same
"length"
or "stop"
values. Building this check into your pipeline prevents silently incomplete outputs
from propagating into downstream stages.
Max Tokens for Structured Output Control
When the model is asked to return structured data — JSON, CSV, or a fixed-schema response — combining a tight max-token limit with a low temperature and an explicit system prompt produces compact, consistent output. A low max-token budget also discourages the model from adding unrequested prose before or after the structured block.
import ollama
import json
system = """
You are a keyword extraction engine.
Return only a valid JSON object with this exact schema:
{"keywords": ["word1", "word2", ...]}
No markdown, no explanation, no extra keys.
"""
def extract_keywords(text: str) -> dict:
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": text}
],
options={
"num_predict": 80, # enough tokens for JSON output
"temperature": 0.0 # deterministic responses
}
)
raw = response["message"]["content"].strip()
return json.loads(raw)
texts = [
"Deep learning models such as convolutional networks and transformers "
"have revolutionised computer vision and natural language processing.",
"Python is widely used in data science for tasks like data cleaning, "
"visualisation, statistical modelling, and machine learning.",
"Docker containers package applications and their dependencies together, "
"enabling consistent deployment across development and production environments.",
]
for text in texts:
result = extract_keywords(text)
print(f"Text : {text[:60]}...")
print(f"Keywords : {result['keywords']}")
print()
Output:
Text : Deep learning models such as convolutional networks and tran...
Keywords : ['convolutional networks', 'transformers', 'computer vision',
'natural language processing']
Text : Python is widely used in data science for tasks like data cl...
Keywords : ['data science', 'data cleaning', 'visualisation', 'statistical modelling',
'machine learning', 'python']
Text : Docker containers package applications and their dependencie...
Keywords : ['Docker', 'containers', 'application', 'dependencies', 'deployment',
'environments']
Setting num_predict=80
is generous enough for the JSON object but tight enough to prevent the model from
appending a paragraph of commentary after the closing brace. Every response parses
cleanly with
json.loads()
with no post-processing required.
Estimating Token Count Before Sending
Before sending a prompt you can estimate the number of tokens it will consume using
the tiktoken
library, which implements the same tokeniser used by OpenAI models and gives a close
approximation for most modern LLMs. Knowing the input token count lets you calculate
how many tokens remain in the context window for the model's reply, and adjust
max_tokens
dynamically.
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens in a string using the tiktoken tokeniser."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def estimate_budget(
system: str,
user: str,
context_window: int = 8192,
reserved_output: int = 512
) -> dict:
"""Estimate token usage and safe max_tokens for a given prompt pair."""
system_tokens = count_tokens(system)
user_tokens = count_tokens(user)
prompt_tokens = system_tokens + user_tokens
safe_output = min(reserved_output, context_window - prompt_tokens)
return {
"system_tokens": system_tokens,
"user_tokens": user_tokens,
"prompt_tokens": prompt_tokens,
"context_window": context_window,
"safe_max_tokens": max(safe_output, 0),
}
system = "You are a helpful assistant that summarises research papers."
user = """
Attention mechanisms have become an integral part of compelling sequence modelling
and transduction models in various tasks, allowing modelling of dependencies without
regard to their distance in the input or output sequences. In this work we propose
the Transformer, a model architecture eschewing recurrence and instead relying
entirely on an attention mechanism to draw global dependencies between input and
output. The Transformer allows for significantly more parallelisation and can reach
a new state of the art in translation quality.
"""
budget = estimate_budget(system, user, context_window=8192, reserved_output=512)
print(f"System prompt tokens : {budget['system_tokens']}")
print(f"User message tokens : {budget['user_tokens']}")
print(f"Total prompt tokens : {budget['prompt_tokens']}")
print(f"Context window : {budget['context_window']}")
print(f"Safe max_tokens : {budget['safe_max_tokens']}")
Output:
System prompt tokens : 11
User message tokens : 98
Total prompt tokens : 109
Context window : 8192
Safe max_tokens : 512
The estimate shows that the prompt consumes 131 tokens from an 8 192-token context
window, leaving 8 061 tokens available for the reply. The
safe_max_tokens
is capped at the
reserved_output
value of 512 — a sensible ceiling for a summary task. In applications that receive
variable-length user messages this kind of dynamic budgeting prevents context
overflow errors at runtime.
Choosing the Right Max Tokens Value
Selecting the right max-token limit follows directly from the task. The two guiding questions are: What is the minimum token count for a complete and useful response? and What is the maximum I can afford in terms of latency and cost? The table below maps common use cases to practical starting values.
| Use Case | Recommended Max Tokens | Notes |
|---|---|---|
| Sentiment / classification label | 5 – 15 | Single word or short phrase output |
| JSON / structured extraction | 50 – 150 | Tight budget discourages prose leakage |
| One-sentence definition | 30 – 60 | Allow a full sentence to complete |
| Short summary (one paragraph) | 100 – 200 | Enough for three to five sentences |
| Chat / conversational reply | 150 – 400 | Natural response without being verbose |
| Code generation (function) | 300 – 600 | Must fit a complete function body |
| Detailed explanation / tutorial | 500 – 1 000 | Multiple sections, examples included |
| Long-form document / report | 1 000 – 4 000 | Check context window limits first |
As a general practice, always check the
done_reason
(or finish_reason)
field of the response in production code. If you see
"length"
more than occasionally for a given task, increase your limit by 25–50 % and re-test.
Silently truncated outputs are one of the most common and hardest-to-debug issues in
LLM applications.
Conclusion
In this post, we briefly learned what max tokens is and how it acts as a hard
ceiling on LLM response length. We explored the difference between tokens and words,
compared tight versus generous token budgets for short and long tasks, built a
programmatic truncation detector using the
done_reason
field, applied a compact budget to structured JSON extraction, estimated prompt token
usage dynamically with
tiktoken,
and built a practical reference table mapping use cases to recommended limits.
Together with temperature and Top-P / Top-K, max tokens completes the core trio of
LLM output-control parameters every developer should understand.
No comments:
Post a Comment