Agglomerative Clustering Example in Python

   A hierarchical type of clustering applies either "top-down" or "bottom-up" method for clustering observation data. Agglomerative is a hierarchical clustering method that applies the "bottom-up" approach to group the elements in a dataset. In this method, each element starts its own cluster and progressively merges with other clusters according to certain criteria.


   A scikit-learn provides the AgglomerativeClustering class to implement the agglomerative clustering method. In this tutorial, we'll learn how to cluster data with the AgglomerativeClustering method in Python. The tutorial covers:

  1. Preparing the data
  2. Clustering with the AgglomerativeClustering
  3. Source code listing   
We'll start by loading the required modules in Python.

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
import numpy as np

Preparing the data

   We'll create a sample dataset to implement clustering in this tutorial. We'll use make_blob function to generate data and visualize it in a plot.

np.random.seed(1)
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=.8)
plt.scatter(x[:,0], x[:,1])
plt.show()



Clustering with the AgglomerativeClustering

   Next, we'll define the model and fit it on x data. A scikit-learn provides an AgglomerativeClustering class to implement the agglomerative clustering algorithm. It has several parameters to set. The linkage parameter defines the merging criteria that the distance method between the sets of the observation data.  The "ward", "complete", "average", and "single" methods can be used. Affinity parameter defines the distance metric to compute the linkage. The number of clusters can be set with the n_clusters parameter.
   Here, we'll set n_clusters number and keep the other parameters as default.

aggloclust=AgglomerativeClustering(n_clusters=5).fit(x)
print(aggloclust)
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=5,
            pooling_func=)
 
We'll get the clustered labels

labels = aggloclust.labels_

Finally, we'll visualize the clustered points by separating them with different colors.

plt.scatter(x[:,0], x[:,1], c=labels)
plt.show()




We can also check the clustering results by changing the number of clusters.

f = plt.figure()
f.add_subplot(2, 2, 1)
for i in range(2, 6):
 aggloclust=AgglomerativeClustering(n_clusters=i).fit(x)
 f.add_subplot(2, 2, i-1)
 plt.scatter(x[:,0], x[:,1], s=5, 
     c=aggloclust.labels_, label="n_cluster-"+str(i))
 plt.legend()
plt.show()


   In this tutorial, we've briefly learned how to cluster data with the Agglomerative clustering method in Python. The model is fast and it provides better results in clustering. The source code is listed below.


Source code listing 

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=.8)
plt.scatter(x[:,0], x[:,1])
plt.show()

aggloclust=AgglomerativeClustering(n_clusters=5).fit(x)
print(aggloclust)
labels = aggloclust.labels_

plt.scatter(x[:,0], x[:,1], c=labels)
plt.show()

f = plt.figure()
f.add_subplot(2, 2, 1)
for i in range(2, 6):
 aggloclust=AgglomerativeClustering(n_clusters=i).fit(x)
 f.add_subplot(2, 2, i-1)
 plt.scatter(x[:,0], x[:,1], s=5, 
    c=aggloclust.labels_, label="n_cluster-"+str(i))
 plt.legend()
plt.show()

No comments:

Post a Comment