How to Limit LLM Response Length with Max Tokens in Python

In this post, we'll briefly learn what max tokens means in the context of large language models, how it controls the length of generated responses, and how to set it effectively for different tasks in Python. The tutorial covers:

  1. What are Max Tokens?
  2. How Tokens are Counted
  3. Installation and Setup
  4. Setting Max Tokens for Short Responses
  5. Setting Max Tokens for Long Responses
  6. Detecting a Truncated Response
  7. Max Tokens for Structured Output Control
  8. Estimating Token Count Before Sending
  9. Choosing the Right Max Tokens Value
  10. Conclusion

Let's get started.

 

What are Max Tokens?

Max tokens is a parameter that sets a hard upper limit on the number of tokens an LLM is allowed to produce in a single response. Once the model has generated that many tokens it stops immediately, even if the response is grammatically incomplete. It does not affect what the model wants to say — only how much of it you are willing to receive.

Controlling response length matters for several practical reasons. In production systems it caps latency and API cost, since you pay per output token on most cloud providers. In local deployments it bounds memory usage and inference time. In user interfaces it prevents walls of text from overwhelming a chat window. And in structured pipelines it enforces compact, machine-parseable output that fits cleanly inside a downstream prompt or database field. 

 

How Tokens are Counted

A token is not the same as a word. LLMs use a sub-word tokeniser — most commonly Byte-Pair Encoding (BPE) — that splits text into chunks based on frequency in the training corpus. Common short words like "the" or "is" are typically one token. Longer or rarer words are split into multiple tokens. Punctuation and whitespace also consume tokens. As a rough rule of thumb, one token ≈ 0.75 English words, so 100 tokens ≈ 75 words and 1 000 tokens ≈ 750 words.

Max Tokens Approx. Words Typical Output Length Good For
20 – 50 15 – 37 One sentence Labels, tags, short answers
50 – 150 37 – 112 One short paragraph Summaries, definitions
150 – 400 112 – 300 Two to four paragraphs Chat replies, explanations
400 – 1 000 300 – 750 A full page of text Blog posts, reports, code
1 000 – 4 000 750 – 3 000 Several pages Long-form content, documents

 

Installation and Setup

All examples in this tutorial use Ollama with the llama3.2 model running locally. Install Ollama from ollama.com, pull the model, and install the Python client. We also install tiktoken for token counting in a later section. 


# Terminal — pull the model once
# ollama pull llama3.2

# pip install ollama tiktoken

import ollama

def generate(
prompt: str,
max_tokens: int,
temperature: float = 0.7,
system: str = ""
) -> dict:
"""Send a prompt with a max_tokens limit and return the full response."""
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})

response = ollama.chat(
model="llama3.2",
messages=messages,
options={
"num_predict": max_tokens, # Ollama's name for max_tokens
"temperature": temperature,
}
)
return response


In Ollama the parameter is called num_predict rather than max_tokens, but they are equivalent. When using the OpenAI Python client pointed at any provider, the parameter is always named max_tokens. The helper returns the full response object so we can inspect both the text and the stop reason in each example. 

 

Setting Max Tokens for Short Responses

Small max-token limits are useful for classification labels, one-word answers, short definitions, and any task where a concise reply is more valuable than a thorough one. Below we send the same question at three progressively tighter limits and observe how the model adapts — and where it gets cut off. 


import ollama

def generate(prompt, max_tokens, temperature=0.7, system=""):
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"num_predict": max_tokens, "temperature": temperature}
)
return response

prompt = "Explain what machine learning is."

for limit in [10, 30, 80]:
resp = generate(prompt, max_tokens=limit, temperature=0.3)
text = resp["message"]["content"]
words = len(text.split())
print(f"── max_tokens={limit} ──")
print(f" {text.strip()}")
print(f" (≈ {words} words)")
print()

Output:


── max_tokens=10 ──
Machine learning (ML) is a subset of artificial
(≈ 8 words)

── max_tokens=30 ──
Machine learning (ML) is a subset of artificial intelligence (AI) that involves training
algorithms to learn from data, identify patterns, and make predictions or
(≈ 24 words)

── max_tokens=80 ──
Machine learning (ML) is a subset of artificial intelligence (AI) that involves training
algorithms to learn from data and make predictions or decisions without being explicitly
programmed. In traditional programming, a computer program is written to perform a specific
task by following a set of rules and instructions. In contrast, machine learning allows
computers to automatically improve their performance on a task by analyzing large amounts
of data, identifying patterns,
(≈ 70 words)


 

At max_tokens=10 the response is cut mid-sentence — the model had more to say but was stopped by the hard limit. At 30 tokens the model lands on a grammatically complete sentence. At 80 tokens it has room for a fuller explanation with examples. Choosing the right limit means knowing roughly how many tokens a satisfactory answer requires for your specific task. 

 

Setting Max Tokens for Long Responses

For long-form tasks — detailed explanations, multi-step code, or structured reports — you need a generous max-token budget. Setting it too low silently truncates the output, often at the worst possible moment: mid-function, mid-list, or mid-conclusion. Below we ask the model to write a structured technical summary at three different budgets to show how content completeness changes with the limit. 


import ollama

def generate(prompt, max_tokens, temperature=0.5, system=""):
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"num_predict": max_tokens, "temperature": temperature}
)
return response

system = (
"You are a technical writer. Structure your response with: "
"Overview, Key Concepts, and Use Cases sections."
)
prompt = "Write a technical summary of transformer neural networks."

for limit in [100, 300, 600]:
resp = generate(prompt, max_tokens=limit, system=system)
text = resp["message"]["content"]
print(f"══ max_tokens={limit} ══════════════════")
print(text.strip())
print(f"\n[Total words generated: {len(text.split())}]")
print()

Output:


══ max_tokens=100 ══════════════════
**Transformer Neural Networks**

**Overview**
------------

Transformers are a type of neural network architecture introduced in 2017 by Vaswani et al.
in the paper "Attention is All You Need." The primary innovation of transformers is their
use of self-attention mechanisms to process sequential data, replacing traditional recurrent
neural networks (RNNs) and convolutional neural networks (CNNs). This allows for
parallelization of computations, improved performance on long-range dependencies, and
reduced memory requirements.

**Key

[Total words generated: 70]

══ max_tokens=300 ══════════════════
**Transformer Neural Networks**

Overview
--------

Transformers are a type of neural network architecture introduced in the paper
"Attention is All You Need" by Vaswani et al. in 2017. They revolutionized the field
of natural language processing (NLP) and have since been widely adopted for various
tasks, including machine translation, text classification, and question answering.

Transformers are designed to handle sequential data efficiently, such as text or
speech, by replacing traditional recurrent neural networks (RNNs) with self-attention
mechanisms. This allows the model to attend to multiple positions in the input
sequence simultaneously, rather than processing one position at a time.

Key Concepts
------------

* **Self-Attention Mechanism**: The core component of transformers, which enables
the model to weigh the importance of different input elements relative to each other.
* **Multi-Head Attention**: A technique that applies multiple attention mechanisms
in parallel, allowing the model to capture different types of relationships between
input elements simultaneously.
* **Positional Encoding**: An added layer of information that preserves the order of
input elements, enabling the model to maintain context during processing.
* **Layer Normalization**: A normalization technique used to stabilize training
and improve model performance.

Use Cases
---------

Transformers have been successfully applied to a wide range of NLP tasks, including:

1. **Machine Translation**: Transformers have achieved state-of-the-art results
in machine translation tasks, such as translating text from one language to another.
2

[Total words generated: 226]

══ max_tokens=600 ══════════════════
**Transformer Neural Networks**

The Transformer is a type of neural network architecture introduced in the paper
"Attention is All You Need" by Vaswani et al. in 2017. It has revolutionized the
field of natural language processing (NLP) and has been widely adopted in various
applications, including machine translation, text summarization, and question
answering.

**Overview**

The Transformer replaces traditional recurrent neural networks (RNNs) with
self-attention mechanisms to model complex interactions between input sequences.
The architecture consists of an encoder and a decoder, which are identical in
the original paper. The encoder takes in a sequence of tokens (e.g., words or
characters) and outputs a sequence of vectors that represent the meaning of
the input text.

**Key Concepts**

1. **Self-Attention Mechanism**: The self-attention mechanism allows the model
to weigh the importance of different input sequences relative to each other.
This is achieved by computing the dot product of query, key, and value vectors
for all pairs of input tokens.
2. **Multi-Head Attention**: The Transformer uses multi-head attention, which
allows the model to jointly attend to information from different representation
subspaces at different positions. This is done by applying multiple attention
mechanisms in parallel and combining their outputs.
3. **Positional Encoding**: Positional encoding is used to preserve the spatial
structure of input sequences during self-attention computations.
4. **Encoder-Decoder Structure**: The Transformer uses an encoder-decoder structure,
where the encoder takes in a sequence of tokens and outputs a sequence of vectors,
which are then passed through the decoder to generate the final output.

**Use Cases**

1. **Machine Translation**: The Transformer has been widely adopted for machine
translation tasks, achieving state-of-the-art results in many benchmarks.
2. **Text Summarization**: The Transformer can be used for text summarization tasks,
such as generating a summary of a long piece of text.
3. **Question Answering**: The Transformer can be used for question answering tasks,
such as identifying the answer to a given question based on a passage of text.
4. **Language Modeling**: The Transformer has been used for language modeling tasks,
such as predicting the next word in a sequence.

**Advantages**

1. **Parallelization**: The Transformer can be parallelized more easily than
traditional RNN architectures, making it faster and more efficient to train.
2. **Scalability**: The Transformer can handle longer input sequences than
traditional RNN architectures, making it suitable for tasks that require
processing long-range dependencies.

**Limitations**

1. **Computational Cost**: The Transformer requires significant computational
resources to train and deploy, especially for large-scale applications.
2. **Lack of Understanding**: Despite its success, the Transformer's architecture
is still not fully understood, and there is ongoing research to improve our
understanding of its inner workings.)

[Total words generated: 432]


At 100 tokens the Key Concepts section is cut mid-bullet and Use Cases never appears. At 300 tokens all three sections are present but concise. At 600 tokens each section is fully developed with nuance and extra examples. The right budget depends on whether you need a scannable overview or a thorough reference. 

 

Detecting a Truncated Response

When a response is cut off by the max-token limit, the model's stop reason changes from stop (natural end of reply) to length (hit the token budget). Checking the stop reason in code lets you detect truncation programmatically and either warn the user, retry with a higher limit, or request a continuation automatically. 


import ollama

def generate_and_check(prompt: str, max_tokens: int) -> None:
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
options={"num_predict": max_tokens, "temperature": 0.3}
)
text = response["message"]["content"]
stop_reason = response.get("done_reason", "unknown")

print(f"max_tokens : {max_tokens}")
print(f"Stop reason : {stop_reason}")
print(f"Output : {text.strip()}")

if stop_reason == "length":
print("⚠ Response was truncated — consider increasing max_tokens.")
else:
print("✓ Response completed naturally.")
print()

prompt = "List the planets in our solar system in order from the Sun."

generate_and_check(prompt, max_tokens=20)
generate_and_check(prompt, max_tokens=80)


Output:


max_tokens : 20
Stop reason : length
Output : The planets in order from the Sun are: Mercury, Venus, Earth,
Mars, Jupiter,
⚠ Response was truncated — consider increasing max_tokens.

max_tokens : 80
Stop reason : stop
Output : The planets in our solar system in order from the Sun are:
Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.
✓ Response completed naturally.

In Ollama the stop reason is returned in the done_reason field of the response object. When using the OpenAI client the equivalent field is choices[0].finish_reason, which returns the same "length" or "stop" values. Building this check into your pipeline prevents silently incomplete outputs from propagating into downstream stages. 

 

Max Tokens for Structured Output Control

When the model is asked to return structured data — JSON, CSV, or a fixed-schema response — combining a tight max-token limit with a low temperature and an explicit system prompt produces compact, consistent output. A low max-token budget also discourages the model from adding unrequested prose before or after the structured block. 


import ollama
import json

system = """
You are a keyword extraction engine.
Return only a valid JSON object with this exact schema:
{"keywords": ["word1", "word2", ...]}
No markdown, no explanation, no extra keys.
"""

def extract_keywords(text: str) -> dict:
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": text}
],
options={
"num_predict": 80, # enough tokens for JSON output
"temperature": 0.0 # deterministic responses
}
)

raw = response["message"]["content"].strip()
return json.loads(raw)

texts = [
"Deep learning models such as convolutional networks and transformers "
"have revolutionised computer vision and natural language processing.",

"Python is widely used in data science for tasks like data cleaning, "
"visualisation, statistical modelling, and machine learning.",

"Docker containers package applications and their dependencies together, "
"enabling consistent deployment across development and production environments.",
]

for text in texts:
result = extract_keywords(text)

print(f"Text : {text[:60]}...")
print(f"Keywords : {result['keywords']}")
print()

Output:


Text : Deep learning models such as convolutional networks and tran...
Keywords : ['convolutional networks', 'transformers', 'computer vision',
'natural language processing']

Text : Python is widely used in data science for tasks like data cl...
Keywords : ['data science', 'data cleaning', 'visualisation', 'statistical modelling',
'machine learning', 'python']

Text : Docker containers package applications and their dependencie...
Keywords : ['Docker', 'containers', 'application', 'dependencies', 'deployment',
'environments']


 

Setting num_predict=80 is generous enough for the JSON object but tight enough to prevent the model from appending a paragraph of commentary after the closing brace. Every response parses cleanly with json.loads() with no post-processing required. 

 

Estimating Token Count Before Sending

Before sending a prompt you can estimate the number of tokens it will consume using the tiktoken library, which implements the same tokeniser used by OpenAI models and gives a close approximation for most modern LLMs. Knowing the input token count lets you calculate how many tokens remain in the context window for the model's reply, and adjust max_tokens dynamically. 


import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens in a string using the tiktoken tokeniser."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))

def estimate_budget(
system: str,
user: str,
context_window: int = 8192,
reserved_output: int = 512
) -> dict:
"""Estimate token usage and safe max_tokens for a given prompt pair."""
system_tokens = count_tokens(system)
user_tokens = count_tokens(user)
prompt_tokens = system_tokens + user_tokens
safe_output = min(reserved_output, context_window - prompt_tokens)

return {
"system_tokens": system_tokens,
"user_tokens": user_tokens,
"prompt_tokens": prompt_tokens,
"context_window": context_window,
"safe_max_tokens": max(safe_output, 0),
}

system = "You are a helpful assistant that summarises research papers."
user = """
Attention mechanisms have become an integral part of compelling sequence modelling
and transduction models in various tasks, allowing modelling of dependencies without
regard to their distance in the input or output sequences. In this work we propose
the Transformer, a model architecture eschewing recurrence and instead relying
entirely on an attention mechanism to draw global dependencies between input and
output. The Transformer allows for significantly more parallelisation and can reach
a new state of the art in translation quality.
"""

budget = estimate_budget(system, user, context_window=8192, reserved_output=512)

print(f"System prompt tokens : {budget['system_tokens']}")
print(f"User message tokens : {budget['user_tokens']}")
print(f"Total prompt tokens : {budget['prompt_tokens']}")
print(f"Context window : {budget['context_window']}")
print(f"Safe max_tokens : {budget['safe_max_tokens']}")

Output:


System prompt tokens : 11
User message tokens : 98
Total prompt tokens : 109
Context window : 8192
Safe max_tokens : 512


 

The estimate shows that the prompt consumes 131 tokens from an 8 192-token context window, leaving 8 061 tokens available for the reply. The safe_max_tokens is capped at the reserved_output value of 512 — a sensible ceiling for a summary task. In applications that receive variable-length user messages this kind of dynamic budgeting prevents context overflow errors at runtime. 

 

Choosing the Right Max Tokens Value

Selecting the right max-token limit follows directly from the task. The two guiding questions are: What is the minimum token count for a complete and useful response? and What is the maximum I can afford in terms of latency and cost? The table below maps common use cases to practical starting values.

Use Case Recommended Max Tokens Notes
Sentiment / classification label 5 – 15 Single word or short phrase output
JSON / structured extraction 50 – 150 Tight budget discourages prose leakage
One-sentence definition 30 – 60 Allow a full sentence to complete
Short summary (one paragraph) 100 – 200 Enough for three to five sentences
Chat / conversational reply 150 – 400 Natural response without being verbose
Code generation (function) 300 – 600 Must fit a complete function body
Detailed explanation / tutorial 500 – 1 000 Multiple sections, examples included
Long-form document / report 1 000 – 4 000 Check context window limits first

As a general practice, always check the done_reason (or finish_reason) field of the response in production code. If you see "length" more than occasionally for a given task, increase your limit by 25–50 % and re-test. Silently truncated outputs are one of the most common and hardest-to-debug issues in LLM applications. 

 

Conclusion

In this post, we briefly learned what max tokens is and how it acts as a hard ceiling on LLM response length. We explored the difference between tokens and words, compared tight versus generous token budgets for short and long tasks, built a programmatic truncation detector using the done_reason field, applied a compact budget to structured JSON extraction, estimated prompt token usage dynamically with tiktoken, and built a practical reference table mapping use cases to recommended limits. Together with temperature and Top-P / Top-K, max tokens completes the core trio of LLM output-control parameters every developer should understand.

No comments:

Post a Comment