In this post, we'll briefly learn what Ollama is, how to set it up, and how to run a local large language model (LLM) entirely on your own machine using Python. The tutorial covers:
- What is Ollama?
- Installation and Setup
- Pulling a Model
- Basic Chat Completion
- Streaming Responses
- Multi-turn Conversation
- Generating Embeddings
- Using the OpenAI-Compatible API
- Conclusion
Let's get started.
What is Ollama?
Ollama is an open-source tool that lets you download, manage, and run large language models locally on your own hardware — no internet connection, no API key, and no data leaving your machine. It bundles model weights, runtime, and a REST API server into a single, easy-to-install application available for macOS, Linux, and Windows.
Ollama supports a wide range of popular open-weight models including Llama 3, Mistral, Gemma, Phi-3, Qwen, and many others from the Ollama model library. Once a model is pulled, it runs as a local HTTP server that exposes a simple API compatible with the OpenAI specification, making it a drop-in replacement for cloud LLM APIs during development and testing.
| Feature | Ollama | Cloud LLM API |
|---|---|---|
| Privacy | Fully local, data never leaves the machine | Data sent to provider servers |
| Cost | Free (hardware cost only) | Pay-per-token pricing |
| Internet Required | Only for initial model download | Every request |
| Latency | Depends on local hardware | Network + server latency |
| Model Choice | Open-weight models only | Proprietary and open models |
Installation and Setup
First, install the Ollama application on your system. Visit ollama.com and download the installer for your operating system, or use the one-line installer on Linux and macOS.
# macOS / Linux – run in your terminal (not Python)
curl -fsSL https://ollama.com/install.sh | sh
# Windows – download the installer from https://ollama.com/download After installation, the Ollama service starts automatically in the background and
listens on http://localhost:11434.
Next, install the official Python client library.
pip install ollama Verify the service is running by checking its version from Python.
import ollama
# Check that the Ollama service is reachable
info = ollama.ps()
print("Ollama service is running.")
print(info)
Output:
Ollama service is running.
models=[] Pulling a Model
Before running inference, you must download a model. We will use Llama 3.2 (3B), a compact and capable model that runs comfortably on a modern CPU with 8 GB of RAM. You can pull it from the terminal or from within Python. The download is a one-time operation; the model is cached locally for all future use.
# Pull from the terminal (recommended for first-time setup)
# ollama pull llama3.2
# Or pull programmatically from Python
import ollama
ollama.pull("llama3.2")
print("Model downloaded and ready.")
You can list all locally available models at any time with
ollama.list().
Some other popular lightweight models worth trying are shown in the table below.
| Model Name | Pull Command | Size | Best For |
|---|---|---|---|
| Llama 3.2 3B | ollama pull llama3.2 |
2.0 GB | General chat, low RAM |
| Mistral 7B | ollama pull mistral |
4.1 GB | Reasoning, instruction following |
| Gemma 2 2B | ollama pull gemma2:2b |
1.6 GB | Fast responses, low memory |
| Phi-3 Mini | ollama pull phi3 |
2.2 GB | Coding, reasoning tasks |
| Qwen2.5 7B | ollama pull qwen2.5 |
4.7 GB | Multilingual, coding |
Basic Chat Completion
The ollama.chat()
function sends a list of messages to a locally running model and returns a response
object. The message format follows the same
role /
content
structure used by the OpenAI Chat API.
import ollama
response = ollama.chat(
model="llama3.2",
messages=[
{
"role": "system",
"content": "You are a concise and helpful AI assistant."
},
{
"role": "user",
"content": "Explain what a transformer model is in two sentences."
}
]
)
print(response["message"]["content"])
Output:
The full response object also contains metadata such as the model name, token counts,
and generation time. You can inspect it with
print(response)
to see all available fields.
Streaming Responses
By default, ollama.chat()
waits for the entire response to be generated before returning. Setting
stream=True
returns a generator that yields each token chunk as it is produced, enabling a
real-time typing effect in the terminal or a web interface.
import ollama
stream = ollama.chat(
model="llama3.2",
messages=[
{
"role": "user",
"content": "Write a short poem about the Python programming language."
}
],
stream=True
)
# Print each chunk as it arrives
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
print() # newline after stream ends
Output:
In twilight realms of code and night
A serpent stirs, with syntax bright
Python's gentle voice whispers low
As loops and functions start to grow
Its vast libraries, like a treasure chest
Awaiting hands that dare to quest
NumPy, pandas, and more to explore
The world of data, forever in store
Indentation's subtle, yet so grand
A language that's both flexible and planned
For beginners and wizards alike to roam
In the realm of Python, home is not a place.
Each chunk dictionary contains a
done
boolean field that becomes
True
on the final chunk. The final chunk also carries the same token-count metadata as the
non-streaming response.
Multi-turn Conversation
Ollama is stateless — each call to
ollama.chat()
is independent. To maintain a multi-turn conversation, you must pass the full message
history in every request. We manage this by appending each user message and assistant
reply to a running
messages
list.
import ollama
def chat(messages: list, user_input: str) -> str:
"""Append user message, get reply, append assistant reply."""
messages.append({"role": "user", "content": user_input})
response = ollama.chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
messages.append({"role": "assistant", "content": reply})
return reply
# Initialise conversation with a system prompt
messages = [
{"role": "system", "content": "You are a helpful data science tutor."}
]
# Turn 1
reply = chat(messages, "What is overfitting in machine learning?")
print(f"User : What is overfitting in machine learning?")
print(f"Model : {reply}\n")
# Turn 2 – the model remembers the previous context
reply = chat(messages, "How can I prevent it?")
print(f"User : How can I prevent it?")
print(f"Model : {reply}\n")
# Turn 3
reply = chat(messages, "Give me a quick Python example using scikit-learn.")
print(f"User : Give me a quick Python example using scikit-learn.")
print(f"Model : {reply}")
Output:
**What is Overfitting?**
In machine learning, overfitting occurs when a model becomes too specialized to the
training data and performs well on it but poorly on new, unseen data. In other words,
a model that has seen the entire dataset during training will perform very well on
the same data, but it may not generalize or make predictions on new, distinct data.
Think of overfitting like fitting a key into a lock too tightly. The model fits the
training data so perfectly that it becomes "locked" to the specific patterns and
relationships in the data, making it impossible for it to fit other keys (new data).
On the second turn, the model correctly interprets "How can I prevent it?" as referring to overfitting — demonstrating that the full conversation history is being passed and understood.
Generating Embeddings
Ollama can also generate text embeddings locally using the
ollama.embeddings()
function. This is useful when you need to compute semantic similarity or build a
retrieval pipeline without sending data to an external service. A dedicated embedding
model such as nomic-embed-text
is recommended over a chat model for this purpose.
import ollama
import numpy as np
# Pull the embedding model once
# ollama pull nomic-embed-text
def get_embedding(text: str) -> np.ndarray:
response = ollama.embeddings(model="nomic-embed-text", prompt=text)
return np.array(response["embedding"])
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
sentences = [
"Machine learning is a subset of artificial intelligence.",
"AI and ML are closely related fields.",
"I enjoy baking bread on weekends.",
"Deep learning uses multi-layer neural networks.",
]
reference = sentences[0]
ref_emb = get_embedding(reference)
print(f"Reference: \"{reference}\"\n")
print(f"{'Score':<8} Sentence")
print("-" * 65)
for sent in sentences[1:]:
emb = get_embedding(sent)
score = cosine_similarity(ref_emb, emb)
print(f"{score:<8.4f} {sent}")
Output:
Reference: "Machine learning is a subset of artificial intelligence."
Score Sentence
-----------------------------------------------------------------
0.6231 AI and ML are closely related fields.
0.4017 I enjoy baking bread on weekends.
0.7601 Deep learning uses multi-layer neural networks.The locally generated embeddings correctly rank the Deep learning sentence as most similar to the reference and assign a lower score to the unrelated baking sentence — all without any data leaving your machine.
Using the OpenAI-Compatible API
Ollama exposes an OpenAI-compatible REST endpoint at
http://localhost:11434/v1.
This means you can use the official
openai
Python client — or any library built on top of it — with zero code changes simply by
pointing it at the local base URL. This is especially convenient for migrating
existing OpenAI-based code to run fully offline.
from openai import OpenAI
# Point the OpenAI client at the local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any non-empty string works
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a concise Python expert."},
{"role": "user", "content": "Show me how to read a CSV file with pandas."}
],
temperature=0.7
)
print(response.choices[0].message.content)
Output:
**Reading a CSV File with Pandas**
=====================================
Here's an example of how to read a CSV file using the pandas library in Python:
```python
import pandas as pd
# Function to read a CSV file
def read_csv_file(file_path):
try:
# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
return df
except FileNotFoundError:
print(f"File not found: {file_path}")
return None
except pd.errors.EmptyDataError:
print(f"No data in file: {file_path}")
return None
except pd.errors.ParserError as e:
print(f"Error parsing file: {e}")
return None
# Usage
file_path = 'data.csv' # Replace with your CSV file path
df = read_csv_file(file_path)
if df is not None:
print(df.head()) # Print the first few rows of the DataFrame
```
In this example:
* We define a function `read_csv_file` that takes the file path as input and
returns the corresponding DataFrame.
* The `pd.read_csv()` function reads the CSV file into a DataFrame, which is
then returned by the function.
* We handle potential errors that may occur while reading the file, such as
file not found, empty data, or parsing errors.
You can replace `'data.csv'` with your actual CSV file path to read and
display its contents using pandas. The `head()` method is used to print the
first few rows of the DataFrame for easy viewing.
Because the interface is identical to the OpenAI SDK, you can also use features such
as stream=True,
temperature,
max_tokens,
and structured output parsing, all routed to your local model.
Conclusion
In this post, we briefly learned what Ollama is and how to use it to run a local large language model entirely on your own hardware using Python. We covered installation and setup, pulling models, basic chat completion, streaming responses, managing multi-turn conversations, generating local embeddings, and using the OpenAI-compatible API. Running LLMs locally with Ollama is an excellent option for privacy-sensitive applications, offline development, and cost-free experimentation.
No comments:
Post a Comment