DataTechNotes: Agglomerative Clustering Example in Python

Agglomerative Clustering is a hierarchical clustering technique used in Python to group similar data points into clusters. Hierarchical clustering can apply either a 'top-down' or 'bottom-up' approach to cluster observational data. Agglomerative is a hierarchical clustering method that utilizes the 'bottom-up' approach to group elements in a dataset. In this method, each element initially forms its own cluster and gradually merges with other clusters based on specific criteria.

Scikit-learn provides the AgglomerativeClustering class to implement the agglomerative clustering method. In this tutorial, we will explore how to cluster data using the AgglomerativeClustering method in Python. The tutorial covers the following topics:

Preparing the data
Clustering example with AgglomerativeClustering
Source code listing

We will begin by loading the required modules in Python.

 
# Import necessary libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
import numpy as np
 

Preparing the data

We'll create a sample dataset to implement clustering in this tutorial. We'll use make_blob function to generate data and visualize it in a plot.

 
# Set a random seed for reproducibilitynp.random.seed(1)

# Generate synthetic data with 5 centers
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.8)

# Create a scatter plot to visualize the data
plt.scatter(x[:, 0], x[:, 1])
plt.show()
 

Clustering example with the AgglomerativeClustering

Next, we will define the model by using Scikit-learn AgglomerativeClustering class and fit the model on x data. The 'linkage' parameter of the model specifies the merging criteria used to determine the distance method between sets of observation data. You can choose from methods like 'ward,' 'complete,' 'average,' and 'single.' The 'affinity' parameter defines the distance metric for computing the linkage and 'n_clusters' parameter defines the number of clusters.

In this example, we will set the number of clusters using the 'n_clusters' parameter while keeping the other parameters at their default values.

# Initialize and fit an Agglomerative Clustering model with 5 clusters
aggloclust = AgglomerativeClustering(n_clusters=5).fit(x)

# Print the details of the clustering model
print(aggloclust)
 

We get the cluster labels from the model object.

# Get cluster labels assigned by the model
labels = aggloclust.labels_
 

Finally, we can visualize the clustered points by separating them with different colors.

# Create a scatter plot with data points colored by cluster labels
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.show()
  

We can also check the clustering results by changing the number of clusters.

 
# Create a figure for multiple subplots
f = plt.figure()

# Add a 2x2 subplot grid
f.add_subplot(2, 2, 1)

# Loop to perform Agglomerative Clustering with varying cluster counts
for i in range(2, 6):
    aggloclust = AgglomerativeClustering(n_clusters=i).fit(x)

    # Add a subplot for each cluster count
    f.add_subplot(2, 2, i - 1)

    # Create a scatter plot with cluster labels and a legend
    plt.scatter(x[:, 0], x[:, 1], s=5, c=aggloclust.labels_, 
             label="n_cluster-" + str(i))
    plt.legend()

# Display the subplots
plt.show()
 

In this tutorial, we've briefly explored how to cluster data using the Agglomerative clustering method in Python. This model is known for its speed and effectiveness in clustering, providing better results. The source code is provided below.

Source code listing

 
# Import necessary libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
import numpy as np

# Set a random seed for reproducibility
np.random.seed(1)

# Generate synthetic data with 5 centers
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.8)

# Create a scatter plot to visualize the data
plt.scatter(x[:, 0], x[:, 1])
plt.show()

# Initialize and fit an Agglomerative Clustering model with 5 clusters
aggloclust = AgglomerativeClustering(n_clusters=5).fit(x)

# Print the details of the clustering model
print(aggloclust)

# Get cluster labels assigned by the model
labels = aggloclust.labels_

# Create a scatter plot with data points colored by cluster labels
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.show()

# Create a figure for multiple subplots
f = plt.figure()

# Add a 2x2 subplot grid
f.add_subplot(2, 2, 1)

# Loop to perform Agglomerative Clustering with varying cluster counts
for i in range(2, 6):
    aggloclust = AgglomerativeClustering(n_clusters=i).fit(x)

    # Add a subplot for each cluster count
    f.add_subplot(2, 2, i - 1)

    # Create a scatter plot with cluster labels and a legend
    plt.scatter(x[:, 0], x[:, 1], s=5, c=aggloclust.labels_, 
             label="n_cluster-" + str(i))
    plt.legend()

# Display the subplots
plt.show()
 

DataTechNotes

Pages

Agglomerative Clustering Example in Python

No comments:

Post a Comment