How to Run a Local LLM in Python with Ollama

In this post, we'll briefly learn what Ollama is, how to set it up, and how to run a local large language model (LLM) entirely on your own machine using Python. The tutorial covers:

  1. What is Ollama?
  2. Installation and Setup
  3. Pulling a Model
  4. Basic Chat Completion
  5. Streaming Responses
  6. Multi-turn Conversation
  7. Generating Embeddings
  8. Using the OpenAI-Compatible API
  9. Conclusion

Let's get started.


What is Ollama?

Ollama is an open-source tool that lets you download, manage, and run large language models locally on your own hardware — no internet connection, no API key, and no data leaving your machine. It bundles model weights, runtime, and a REST API server into a single, easy-to-install application available for macOS, Linux, and Windows.

Ollama supports a wide range of popular open-weight models including Llama 3, Mistral, Gemma, Phi-3, Qwen, and many others from the Ollama model library. Once a model is pulled, it runs as a local HTTP server that exposes a simple API compatible with the OpenAI specification, making it a drop-in replacement for cloud LLM APIs during development and testing. 

Feature Ollama Cloud LLM API
Privacy Fully local, data never leaves the machine Data sent to provider servers
Cost Free (hardware cost only) Pay-per-token pricing
Internet Required Only for initial model download Every request
Latency Depends on local hardware Network + server latency
Model Choice Open-weight models only Proprietary and open models

 

Installation and Setup

First, install the Ollama application on your system. Visit ollama.com and download the installer for your operating system, or use the one-line installer on Linux and macOS.


# macOS / Linux – run in your terminal (not Python)
curl -fsSL https://ollama.com/install.sh | sh

# Windows – download the installer from https://ollama.com/download
 

 After installation, the Ollama service starts automatically in the background and listens on http://localhost:11434. Next, install the official Python client library. 

 
pip install ollama
  

 Verify the service is running by checking its version from Python.

 
import ollama

 # Check that the Ollama service is reachable
info = ollama.ps()
print("Ollama service is running.")
print(info)

 Output: 


Ollama service is running.
models=[]
 

 

Pulling a Model

Before running inference, you must download a model. We will use Llama 3.2 (3B), a compact and capable model that runs comfortably on a modern CPU with 8 GB of RAM. You can pull it from the terminal or from within Python. The download is a one-time operation; the model is cached locally for all future use. 

 
# Pull from the terminal (recommended for first-time setup)
# ollama pull llama3.2

# Or pull programmatically from Python
import ollama

ollama.pull("llama3.2")
print("Model downloaded and ready.")
 

 You can list all locally available models at any time with ollama.list(). Some other popular lightweight models worth trying are shown in the table below.

Model Name Pull Command Size Best For
Llama 3.2 3B ollama pull llama3.2 2.0 GB General chat, low RAM
Mistral 7B ollama pull mistral 4.1 GB Reasoning, instruction following
Gemma 2 2B ollama pull gemma2:2b 1.6 GB Fast responses, low memory
Phi-3 Mini ollama pull phi3 2.2 GB Coding, reasoning tasks
Qwen2.5 7B ollama pull qwen2.5 4.7 GB Multilingual, coding

 

Basic Chat Completion

The ollama.chat() function sends a list of messages to a locally running model and returns a response object. The message format follows the same role / content structure used by the OpenAI Chat API. 

import ollama

response = ollama.chat(
model="llama3.2",
messages=[
{
"role": "system",
"content": "You are a concise and helpful AI assistant."
},
{
"role": "user",
"content": "Explain what a transformer model is in two sentences."
}
]
)
print(response["message"]["content"])
 

Output:

 
A transformer model is a type of neural network architecture that uses self-attention mechanisms to process sequential data, such as text or speech, and has achieved state-of-the-art results in various natural language processing tasks. Unlike traditional recurrent neural networks, transformers do not rely on recurrent connections to capture temporal dependencies, instead  using parallelized attention mechanisms to weigh the importance of input elements relative to each other.
 

The full response object also contains metadata such as the model name, token counts, and generation time. You can inspect it with print(response) to see all available fields. 

 

Streaming Responses

By default, ollama.chat() waits for the entire response to be generated before returning. Setting stream=True returns a generator that yields each token chunk as it is produced, enabling a real-time typing effect in the terminal or a web interface. 

 
import ollama

stream = ollama.chat(
model="llama3.2",
messages=[
{
"role": "user",
"content": "Write a short poem about the Python programming language."
}
],
stream=True
)

# Print each chunk as it arrives
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)

print() # newline after stream ends
 

Output:


In twilight realms of code and night
A serpent stirs, with syntax bright
Python's gentle voice whispers low
As loops and functions start to grow

Its vast libraries, like a treasure chest
Awaiting hands that dare to quest
NumPy, pandas, and more to explore
The world of data, forever in store

Indentation's subtle, yet so grand
A language that's both flexible and planned
For beginners and wizards alike to roam
In the realm of Python, home is not a place.
 

Each chunk dictionary contains a done boolean field that becomes True on the final chunk. The final chunk also carries the same token-count metadata as the non-streaming response. 

 

Multi-turn Conversation

Ollama is stateless — each call to ollama.chat() is independent. To maintain a multi-turn conversation, you must pass the full message history in every request. We manage this by appending each user message and assistant reply to a running messages list. 

 
import ollama

def chat(messages: list, user_input: str) -> str:
"""Append user message, get reply, append assistant reply."""
messages.append({"role": "user", "content": user_input})
response = ollama.chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
messages.append({"role": "assistant", "content": reply})
return reply

# Initialise conversation with a system prompt
messages = [
{"role": "system", "content": "You are a helpful data science tutor."}
]

# Turn 1
reply = chat(messages, "What is overfitting in machine learning?")
print(f"User : What is overfitting in machine learning?")
print(f"Model : {reply}\n")

# Turn 2 – the model remembers the previous context
reply = chat(messages, "How can I prevent it?")
print(f"User : How can I prevent it?")
print(f"Model : {reply}\n")

# Turn 3
reply = chat(messages, "Give me a quick Python example using scikit-learn.")
print(f"User : Give me a quick Python example using scikit-learn.")
print(f"Model : {reply}")

Output:

 
User : What is overfitting in machine learning?
Model : Overfitting is a fundamental concept in machine learning that's essential
to understand, especially for beginners.

**What is Overfitting?**

In machine learning, overfitting occurs when a model becomes too specialized to the
training data and performs well on it but poorly on new, unseen data. In other words,
a model that has seen the entire dataset during training will perform very well on
the same data, but it may not generalize or make predictions on new, distinct data.
Think of overfitting like fitting a key into a lock too tightly. The model fits the
training data so perfectly that it becomes "locked" to the specific patterns and
relationships in the data, making it impossible for it to fit other keys (new data).
 
**Symptoms of Overfitting:**

1. **High Training Accuracy**: A model performs exceptionally well on its own training data.
2. **Low Testing Accuracy**: The same model performs poorly on new, unseen data.
3. **Large Number of Features**: Using more features than necessary can lead to overfitting.
4. **Complex Model Architecture**: Models with many layers or complex connections are more 
prone to overfitting.

**Why Does Overfitting Happen?**

1. **Model Complexity**: Too much complexity in the model can lead to overfitting.
2. **Limited Training Data**: Insufficient training data can prevent the model from learning 
generalizable patterns.
3. **Noise and Irrelevant Features**: Including noise or irrelevant features in the dataset 
can cause the model to fit the noise rather than the underlying patterns.

**How Can We Prevent Overfitting?**

1. **Regularization Techniques**: Techniques like L1 and L2 regularization, dropout, and 
early stopping can help prevent overfitting.
2. **Data Augmentation**: Increasing the size of the training dataset through augmentation 
techniques can reduce overfitting.
3. **Cross-Validation**: Using cross-validation to evaluate model performance on multiple 
subsets of the data can help detect overfitting.
4. **Simplifying the Model Architecture**: Reducing the complexity of the model architecture 
can help prevent overfitting.

By understanding and addressing overfitting, you can improve the performance and 
generalizability of your machine learning models.
Do you have any specific questions or scenarios related to overfitting?

User : How can I prevent it?
Model : Preventing overfitting is a crucial step in building robust machine learning models. 
Here are some techniques to help you prevent overfitting:

**Regularization Techniques**

1. **L1 and L2 Regularization**: Add a penalty term to the loss function that discourages 
large weights. This helps reduce model complexity.
* L1 (Lasso) regularization: `α * |w|`, where `α` is the regularization parameter.
* L2 (Ridge) regularization: `α * w^2`, where `α` is the regularization parameter.
2. **Dropout**: Randomly drop out units during training to prevent overreliance on individual 
features.
3. **Early Stopping**: Stop training when the model's performance on the validation set 
starts to degrade.

**Data Augmentation**

1. **Image Data Augmentation**: Apply transformations like rotation, scaling, and flipping 
to increase dataset size without generating new data.
2. **Text Data Augmentation**: Use techniques like word embedding, synonym replacement, or 
back-translation to increase diversity in text data.
3. **Time Series Data Augmentation**: Use time series decomposition techniques like seasonal 
decomposition or differencing to generate additional data points.

**Model Simplification**

1. **Reduce Model Capacity**: Decrease the number of layers, neurons, or parameters in your 
model.
2. **Use Pre-Trained Models**: Leverage pre-trained models and fine-tune them for your 
specific task to reduce overfitting.
3. **Ensemble Methods**: Combine multiple models with different architectures or 
hyperparameters to improve overall performance.

**Other Techniques**

1. **Cross-Validation**: Evaluate model performance on multiple subsets of the data to detect 
overfitting.
2. **Batch Normalization**: Normalize inputs to each layer to reduce internal covariate shift 
and improve stability.
3. **Data Filtering**: Remove noise, outliers, or irrelevant features from your dataset.

**Choosing the Right Hyperparameters**

1. **Regularization Strength**: Choose a regularization strength that balances model 
complexity with overfitting prevention.
2. **Learning Rate**: Optimize the learning rate to prevent overshooting and under-shooting 
during training.
3. **Number of Epochs**: Balance the number of epochs with the risk of overfitting.

**Remember**

Overfitting is often a trade-off between model accuracy and generalizability. By using these 
techniques, you can find a balance that works for your specific problem.

Which technique would you like to explore further?

User : Give me a quick Python example using scikit-learn.
Model : Here's an example of overfitting prevention in scikit-learn using regularization and 
dropout:

**Example:**

Let's create a simple linear regression model on the famous Iris dataset, which is prone to 
overfitting.
```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load iris dataset
iris = load_iris()
X = iris.data[:, :2] # Use only two features for simplicity
y = iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a simple linear regression model
model = LinearRegression()

# Train the model with regularization (L1) and dropout
alpha = 0.01
dropout_rate = 0.5

model.fit(X_train + np.random.rand(X_train.shape[0], 2) * alpha, y_train,
regularization=None, solver='lbfgs', max_iter=1000)

# Train the model with dropout
model.drop_out(dropout_rate)
model.fit(X_train, y_train)

# Evaluate model performance on test set
y_pred_l1 = model.predict(X_test)
y_pred_dropout = model.predict(X_test)

mse_l1 = mean_squared_error(y_test, y_pred_l1)
mse_dropout = mean_squared_error(y_test, y_pred_dropout)

print(f"L1 Regularization MSE: {mse_l1:.2f}")
print(f"Dropout MSE: {mse_dropout:.2f}")

# Plot the models' performance curves
import matplotlib.pyplot as plt

plt.plot(mse_l1, label='L1 Regularization')
plt.plot(mse_dropout, label='Dropout')
plt.legend()
plt.show()
```
In this example, we train a linear regression model on two features of the Iris dataset using 
L1 regularization and dropout. We also plot the models' performance curves to compare their 
accuracy.

The `alpha` parameter controls the strength of regularization (L1), while the `dropout_rate` 
parameter controls the proportion of units that are randomly dropped during training.

By applying regularization and dropout, we can prevent overfitting and improve our model's 
generalizability on new data.
 

On the second turn, the model correctly interprets "How can I prevent it?" as referring to overfitting — demonstrating that the full conversation history is being passed and understood.

 

Generating Embeddings

Ollama can also generate text embeddings locally using the ollama.embeddings() function. This is useful when you need to compute semantic similarity or build a retrieval pipeline without sending data to an external service. A dedicated embedding model such as nomic-embed-text is recommended over a chat model for this purpose. 

 
import ollama
import numpy as np

# Pull the embedding model once
# ollama pull nomic-embed-text

def get_embedding(text: str) -> np.ndarray:
response = ollama.embeddings(model="nomic-embed-text", prompt=text)
return np.array(response["embedding"])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

sentences = [
"Machine learning is a subset of artificial intelligence.",
"AI and ML are closely related fields.",
"I enjoy baking bread on weekends.",
"Deep learning uses multi-layer neural networks.",
]

reference = sentences[0]
ref_emb = get_embedding(reference)

print(f"Reference: \"{reference}\"\n")
print(f"{'Score':<8} Sentence")
print("-" * 65)

for sent in sentences[1:]:
emb = get_embedding(sent)
score = cosine_similarity(ref_emb, emb)
print(f"{score:<8.4f} {sent}")

 Output:

 
Reference: "Machine learning is a subset of artificial intelligence."

Score Sentence
-----------------------------------------------------------------
0.6231 AI and ML are closely related fields.
0.4017 I enjoy baking bread on weekends.
0.7601 Deep learning uses multi-layer neural networks.
 

  The locally generated embeddings correctly rank the Deep learning sentence as most similar to the reference and assign a lower score to the unrelated baking sentence — all without any data leaving your machine. 

 

Using the OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST endpoint at http://localhost:11434/v1. This means you can use the official openai Python client — or any library built on top of it — with zero code changes simply by pointing it at the local base URL. This is especially convenient for migrating existing OpenAI-based code to run fully offline.

 
from openai import OpenAI

# Point the OpenAI client at the local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any non-empty string works
)

response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a concise Python expert."},
{"role": "user", "content": "Show me how to read a CSV file with pandas."}
],
temperature=0.7
)

print(response.choices[0].message.content)

 Output:

 
 **Reading a CSV File with Pandas**
=====================================

Here's an example of how to read a CSV file using the pandas library in Python:

```python
import pandas as pd

# Function to read a CSV file
def read_csv_file(file_path):
try:
# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

return df

except FileNotFoundError:
print(f"File not found: {file_path}")
return None

except pd.errors.EmptyDataError:
print(f"No data in file: {file_path}")
return None

except pd.errors.ParserError as e:
print(f"Error parsing file: {e}")
return None

# Usage
file_path = 'data.csv' # Replace with your CSV file path
df = read_csv_file(file_path)

if df is not None:
print(df.head()) # Print the first few rows of the DataFrame
```

In this example:

* We define a function `read_csv_file` that takes the file path as input and
returns the corresponding DataFrame.
* The `pd.read_csv()` function reads the CSV file into a DataFrame, which is
then returned by the function.
* We handle potential errors that may occur while reading the file, such as
file not found, empty data, or parsing errors.

You can replace `'data.csv'` with your actual CSV file path to read and
display its contents using pandas. The `head()` method is used to print the
first few rows of the DataFrame for easy viewing.

Because the interface is identical to the OpenAI SDK, you can also use features such as stream=True, temperature, max_tokens, and structured output parsing, all routed to your local model. 

 

Conclusion

In this post, we briefly learned what Ollama is and how to use it to run a local large language model entirely on your own hardware using Python. We covered installation and setup, pulling models, basic chat completion, streaming responses, managing multi-turn conversations, generating local embeddings, and using the OpenAI-compatible API. Running LLMs locally with Ollama is an excellent option for privacy-sensitive applications, offline development, and cost-free experimentation. 

  

 

No comments:

Post a Comment