In this post, we'll briefly learn what few-shot prompting is, how it works, and how to apply it to real-world NLP tasks to produce more accurate and consistent outputs from a large language model in Python. The tutorial covers:
- What is Few-Shot Prompting?
- How Few-Shot Prompting Works
- Installation and Setup
- Zero-Shot vs Few-Shot Comparison
- Few-Shot Text Classification
- Few-Shot Named Entity Extraction
- Few-Shot Structured JSON Output
- Few-Shot Style Transfer
- Choosing the Right Number of Examples
- Conclusion
Let's get started.
What is Few-Shot Prompting?
Few-shot prompting is a technique in which you include a small number of worked examples — typically two to eight input-output pairs — directly inside the prompt before presenting the real task to the model. By seeing concrete demonstrations of the desired behaviour, the model infers the pattern and applies it to the new input without any weight updates or fine-tuning. The examples act as an in-context specification of exactly what format, vocabulary, tone, and reasoning style you expect.
Few-shot prompting sits between two other strategies on the prompting spectrum. Zero-shot prompting gives the model only instructions, relying entirely on its pre-trained knowledge. Fine-tuning permanently updates the model weights on thousands of labelled examples. Few-shot prompting is the practical middle ground: it costs only a few extra input tokens per request and requires no training infrastructure, yet consistently outperforms zero-shot on structured, domain-specific, or format-sensitive tasks.
How Few-Shot Prompting Works
In the chat-message format used by modern LLMs, few-shot examples are injected as
alternating
user
and
assistant
messages placed between the system prompt and the real user query. The model sees these
prior exchanges as if they are part of the conversation history and continues the
pattern naturally into the next reply.
| Prompting Strategy | Examples Provided | Training Required | Best For |
|---|---|---|---|
| Zero-shot | 0 | None | Simple, well-known tasks |
| One-shot | 1 | None | Format demonstration |
| Few-shot | 2 – 8 | None | Structured, domain-specific tasks |
| Fine-tuning | Hundreds to thousands | Yes — GPU training | High-volume, specialised tasks |
The quality of few-shot examples matters as much as their quantity. Each example should be representative of the real input distribution, demonstrate the exact output format you want, and cover edge cases where zero-shot is likely to fail. Well-chosen examples are more valuable than a large number of generic ones.
Installation and Setup
All examples in this tutorial use Ollama with the
llama3.2
model running locally. Install Ollama from
ollama.com, pull the model, and
install the Python client. The few-shot message format is identical across the OpenAI,
Anthropic, and Google Gemini APIs — only the client initialisation differs.
# Terminal — pull the model once
# ollama pull llama3.2
# pip install ollama
import ollama
def chat(
system: str,
examples: list[tuple[str, str]],
user_input: str,
temperature: float = 0.0,
max_tokens: int = 200
) -> str:
"""
Build a few-shot message list and return the model reply.
Parameters
----------
system : The system prompt defining the task and rules.
examples : List of (user_input, assistant_output) demonstration pairs.
user_input : The real query to answer.
temperature: Sampling temperature (0.0 for structured tasks).
max_tokens : Maximum tokens to generate.
"""
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
The helper function
chat()
takes a system prompt, a list of
(user, assistant)
example tuples, and the real query, then assembles the full message list automatically.
All sections below call this helper so the focus stays on the examples rather than the
API boilerplate.
Zero-Shot vs Few-Shot Comparison
The clearest way to appreciate few-shot prompting is to run the same task with zero examples and then with a handful of examples and compare the outputs side by side. We use a product review tone classifier that must return one of three labels: Positive, Negative, or Mixed. The zero-shot version gives only instructions; the few-shot version also gives three labelled examples.
import ollama
def chat(system, examples, user_input, temperature=0.0, max_tokens=50):
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
system = (
"Classify the tone of a product review as exactly one of: "
"Positive, Negative, or Mixed. Reply with the label only."
)
examples = [
("The battery life is incredible but the screen cracked after a week.",
"Mixed"),
("Absolutely love this product — fast shipping and great quality!",
"Positive"),
("Stopped working after two days. Complete waste of money.",
"Negative"),
]
reviews = [
"Decent product overall but the instructions were confusing.",
"Best purchase I have made this year — highly recommend!",
"Arrived damaged and customer support never replied.",
"Good value for the price, though delivery took longer than expected.",
]
print(f"{'Review':<55} {'Zero-Shot':<12} {'Few-Shot'}")
print("-" * 85)
for review in reviews:
zero = chat(system, examples=[], user_input=review)
few = chat(system, examples=examples, user_input=review)
print(f"{review[:54]:<55} {zero:<12} {few}")
Output:
Review Zero-Shot Few-Shot
-------------------------------------------------------------------------------------
Decent product overall but the instructions were conf… Neutral Mixed
Best purchase I have made this year — highly recommend Positive Positive
Arrived damaged and customer support never replied. Negative Negative
Good value for the price, though delivery took longer… Neutral Mixed
The zero-shot model invents the label Neutral — which was never in the allowed set — because it relies on its general knowledge of sentiment analysis rather than the three labels defined in the system prompt. The few-shot model stays within the specified label set throughout and correctly identifies the ambivalent reviews as Mixed rather than hallucinating a fourth category.
Few-Shot Text Classification
Few-shot prompting excels at multi-class text classification where the label set is custom or the boundaries between classes are subtle. Below we build a support ticket classifier that routes tickets into one of four departments. Three examples per message type would require fine-tuning to encode reliably, but three few-shot examples teach the pattern instantly.
import ollama
def chat(system, examples, user_input, temperature=0.0, max_tokens=20):
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
system = (
"You are a support ticket router. Classify each ticket into exactly "
"one department: Billing, Technical, Returns, or General. "
"Reply with the department name only."
)
examples = [
("I was charged twice for my last order.", "Billing"),
("My app crashes every time I try to log in on Android.", "Technical"),
("I want to send back the headphones I bought last week.", "Returns"),
("Do you ship internationally?", "General"),
("The discount code SAVE20 is not working at checkout.", "Billing"),
("My smart speaker is not connecting to my Wi-Fi network.", "Technical"),
]
tickets = [
"I received the wrong item and want a refund.",
"How do I reset my account password?",
"There is an extra charge on my invoice I don't recognise.",
"The LED on my device stays red and never turns green.",
"What are your store opening hours?",
"I never received my order from three weeks ago.",
]
print(f"{'Ticket':<55} {'Department'}")
print("-" * 70)
for ticket in tickets:
label = chat(system, examples, ticket)
print(f"{ticket:<55} {label}")
Output:
Ticket Department
----------------------------------------------------------------------
I received the wrong item and want a refund. Returns
How do I reset my account password? Technical
There is an extra charge on my invoice I don't recog… Billing
The LED on my device stays red and never turns green. Technical
What are your store opening hours? General
I never received my order from three weeks ago. Returns
All six tickets are correctly routed with only six examples in the prompt — fewer
examples than there are classes, yet the model generalises cleanly. The
max_tokens=20
budget is tight enough to prevent explanatory text from appearing alongside the label.
Few-Shot Named Entity Extraction
Named entity extraction with a custom schema benefits enormously from few-shot examples because the entity types and output format are often project-specific and cannot be guessed from a system prompt alone. Below we extract Company, Product, and Price entities from short retail descriptions using three demonstrations to define the exact output structure.
import ollama
import json
def chat(system, examples, user_input, temperature=0.0, max_tokens=150):
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
system = (
"Extract named entities from retail text. "
"Return only valid JSON with keys: company, product, price. "
"Use null for any entity not mentioned. No markdown, no extra text."
)
examples = [
(
"Apple has launched the iPhone 16 Pro for $999.",
'{"company": "Apple", "product": "iPhone 16 Pro", "price": "$999"}'
),
(
"Samsung's new Galaxy Watch 7 is now available at $299.",
'{"company": "Samsung", "product": "Galaxy Watch 7", "price": "$299"}'
),
(
"Sony announced the WH-1000XM6 headphones but has not revealed pricing yet.",
'{"company": "Sony", "product": "WH-1000XM6", "price": null}'
),
]
sentences = [
"Google unveiled the Pixel 9a smartphone priced at $499.",
"Microsoft is releasing a new Surface Laptop 7 for $1,299.",
"NVIDIA announced the RTX 5090 GPU with no price disclosed yet.",
"The new Dyson V16 vacuum cleaner retails for $849.",
]
for sentence in sentences:
raw = chat(system, examples, sentence)
result = json.loads(raw)
print(f"Text : {sentence}")
print(f"Company : {result['company']}")
print(f"Product : {result['product']}")
print(f"Price : {result['price']}")
print()
Output:
Text : Google unveiled the Pixel 9a smartphone priced at $499.
Company : Google
Product : Pixel 9a
Price : $499
Text : Microsoft is releasing a new Surface Laptop 7 for $1,299.
Company : Microsoft
Product : Surface Laptop 7
Price : $1,299
Text : NVIDIA announced the RTX 5090 GPU with no price disclosed yet.
Company : NVIDIA
Product : RTX 5090
Price : null
Text : The new Dyson V16 vacuum cleaner retails for $849.
Company : Dyson
Product : V16
Price : $849
The model correctly handles all four sentences including the case where no price is
mentioned — returning
null
exactly as demonstrated in the third example. Every response parses cleanly with
json.loads()
with zero post-processing.
Few-Shot Structured JSON Output
When a pipeline requires multi-field structured output, few-shot examples are the most reliable way to lock in the exact schema. Below we build a job-posting parser that extracts six fields from free-form text. The three examples teach the model the full schema including the correct handling of list fields and missing values.
import ollama
import json
def chat(system, examples, user_input, temperature=0.0, max_tokens=250):
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
system = (
"Parse job postings into structured JSON. "
"Keys: title, company, location, salary, skills (list), remote (bool). "
"Use null for missing fields. Return JSON only — no markdown, no preamble."
)
examples = [
(
"DataTech is hiring a Senior Python Developer in Berlin. "
"Salary: €80,000–€100,000. Must know Python, FastAPI, and PostgreSQL. "
"Fully remote.",
json.dumps({
"title": "Senior Python Developer",
"company": "DataTech",
"location": "Berlin",
"salary": "€80,000–€100,000",
"skills": ["Python", "FastAPI", "PostgreSQL"],
"remote": True
})
),
(
"QuantumAI is looking for a Machine Learning Engineer in San Francisco. "
"Compensation: $130,000–$160,000/year. "
"Required: PyTorch, Python, MLOps, Kubernetes. On-site only.",
json.dumps({
"title": "Machine Learning Engineer",
"company": "QuantumAI",
"location": "San Francisco",
"salary": "$130,000–$160,000/year",
"skills": ["PyTorch", "Python", "MLOps", "Kubernetes"],
"remote": False
})
),
(
"CloudBase needs a DevOps Engineer. Experience with Docker, Terraform, "
"and AWS required. Salary not disclosed. Location: London. Hybrid.",
json.dumps({
"title": "DevOps Engineer",
"company": "CloudBase",
"location": "London",
"salary": None,
"skills": ["Docker", "Terraform", "AWS"],
"remote": False
})
),
]
postings = [
"NeuralWorks is hiring a Data Scientist in Amsterdam. Salary: €70,000. "
"Skills needed: Python, scikit-learn, SQL, Tableau. Remote-friendly.",
"StreamFlow seeks a Backend Engineer in Toronto. Pay: CAD 110,000–130,000. "
"Must have Go, gRPC, Redis, and PostgreSQL experience. Fully remote.",
]
for posting in postings:
raw = chat(system, examples, posting)
result = json.loads(raw)
print(json.dumps(result, indent=2))
print()
Output:
{
"title": "Data Scientist",
"company": "NeuralWorks",
"location": "Amsterdam",
"salary": "€70,000",
"skills": ["Python", "scikit-learn", "SQL", "Tableau"],
"remote": true
}
{
"title": "Backend Engineer",
"company": "StreamFlow",
"location": "Toronto",
"salary": "CAD 110,000–130,000",
"skills": ["Go", "gRPC", "Redis", "PostgreSQL"],
"remote": true
}
Both postings parse into the exact six-field schema with correct types — skills as a list, remote as a boolean — directly inferred from the three examples. Adding a fourth example that covers salary ranges in a different currency, or a fully on-site role, would further harden the extractor against edge cases.
Few-Shot Style Transfer
Style transfer — rewriting text to match a specific tone, voice, or format — is one of the most creative applications of few-shot prompting. Instead of describing the target style in abstract terms, examples demonstrate it precisely. Below we rewrite plain technical sentences into an engaging blog voice using three demonstrations to define the style.
import ollama
def chat(system, examples, user_input, temperature=0.7, max_tokens=120):
messages = [{"role": "system", "content": system}]
for user_ex, assistant_ex in examples:
messages.append({"role": "user", "content": user_ex})
messages.append({"role": "assistant", "content": assistant_ex})
messages.append({"role": "user", "content": user_input})
response = ollama.chat(
model="llama3.2",
messages=messages,
options={"temperature": temperature, "num_predict": max_tokens}
)
return response["message"]["content"].strip()
system = (
"Rewrite the given technical sentence into an engaging, conversational "
"blog style. Keep it to one or two sentences. Preserve the facts exactly."
)
examples = [
(
"Gradient descent is an optimisation algorithm that minimises a "
"loss function by iteratively updating model parameters.",
"Think of gradient descent as a hiker lost in the fog — it can't see "
"the full mountain, but it always takes one careful step downhill until "
"it finds the valley."
),
(
"A convolutional neural network applies learnable filters to input "
"images to extract spatial features hierarchically.",
"CNNs are basically very enthusiastic pattern detectors — they scan "
"your photo at every scale, hunting for edges, then shapes, then "
"full objects, layer by layer."
),
(
"Tokenisation splits raw text into sub-word units that a language "
"model processes as discrete numerical inputs.",
"Before an LLM reads a single word, it runs text through a blender — "
"chopping sentences into bite-sized token pieces that numbers can "
"actually describe."
),
]
sentences = [
"Dropout randomly deactivates neurons during training to prevent overfitting.",
"Transformers use self-attention to weigh the relevance of every token "
"against every other token in a sequence.",
"Retrieval-augmented generation combines a language model with an external "
"knowledge base to produce grounded, factual responses.",
]
for sentence in sentences:
rewrite = chat(system, examples, sentence)
print(f"Original : {sentence}")
print(f"Blog : {rewrite}")
print()
Output:
Original : Dropout randomly deactivates neurons during training to prevent overfitting.
Blog : Dropout is the neural network equivalent of a study group where half
the members randomly call in sick — and somehow the team performs
better for it.
Original : Transformers use self-attention to weigh the relevance of every token
against every other token in a sequence.
Blog : Imagine every word in a sentence turning to face every other word and
asking, "How much should I care about you right now?" — that is
self-attention in one question.
Original : Retrieval-augmented generation combines a language model with an
external knowledge base to produce grounded, factual responses.
Blog : RAG is what happens when you give an LLM a library card — instead
of guessing, it looks things up first and then writes from what
it actually found.
The model has inferred the target style from the examples — concrete analogies, conversational dashes, and a slightly playful voice — and applies it consistently to all three new sentences. Without the examples the same system prompt would produce competent but generic rewrites. The examples supply the voice fingerprint that no written description can fully capture.
Choosing the Right Number of Examples
More examples are not always better. Each demonstration consumes input tokens, adds latency, and increases cost — and returns diminish quickly beyond a small number of well-chosen examples. The right count depends on the task complexity, the desired output format, and the model size. The table below provides practical starting points.
| Task Type | Recommended Examples | Reason |
|---|---|---|
| Binary classification | 1 – 2 per class | Simple decision boundary; one of each label suffices |
| Multi-class classification | 1 – 2 per class | Cover every label in the schema at least once |
| JSON / structured extraction | 2 – 4 | Demonstrate schema including null / edge cases |
| Style transfer / rewriting | 3 – 5 | More examples establish the voice fingerprint better |
| Complex reasoning / chain-of-thought | 3 – 8 | Each step of reasoning must be modelled explicitly |
| Code generation | 2 – 4 | Show expected style, naming, and docstring conventions |
As a practical rule, start with two to three examples and add more only if the model still produces incorrect format or labels. Always vary the examples to cover different surface forms of the input — a classifier trained on three identical-looking positives will fail on anything that looks slightly different. If accuracy plateaus below your threshold after six to eight examples, consider fine-tuning instead.
Conclusion
In this post, we briefly learned what few-shot prompting is and how it improves LLM output consistency and accuracy by including labelled examples directly inside the prompt. We compared zero-shot and few-shot performance on a tone classifier, built a support ticket router, extracted custom named entities into JSON, parsed multi-field job postings into a structured schema, and applied few-shot style transfer to rewrite technical sentences into a conversational blog voice. Few-shot prompting is one of the most cost-effective techniques in the prompt-engineering toolkit — it requires no training data pipeline, no GPU time, and no model changes, yet delivers results that rival fine-tuned models on many structured tasks.
No comments:
Post a Comment