DataTechNotes: Principal Component Analysis (PCA) Example in Python

Principal Component Analysis (PCA) is an unsupervised learning approach of the feature data by changing the dimensions and reducing the variables in a dataset. No label or response data is considered in this analysis. The Scikit-learn API provides the PCA transformer function that learns components of data and projects input data on learned components.

In this tutorial, we'll briefly learn how to do principle components analysis by using the PCA function, change data dimensions, and visualize the projected data in Python. The tutorial covers:

Extracting principal components
Dimension changing and visualizing
Iris PCA Example
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

Extracting principal components

First, we'll generate simple random data for this tutorial. Here, we'll generate 2 feature data and visualize it in a plot.

n = 100
x1 = [i/10 for i in range(n)]
x2 = np.random.uniform(-3, 5, n)+ x1
x = np.array([x1, x2]).T
plt.scatter(x[:,0],x[:,1])
plt.grid(True)
plt.show()

Then, we'll apply PCA function. First, we'll define the function with PCA() by setting 2 into the n_components parameter then fit it on x data. After the fitting, we can get component, mean, and covariance data.

pc = PCA(n_components=2)
pc = pc.fit(x)

print("components:", pc.components_)
print("mean:      ", pc.mean_)
print("covariance:", pc.get_covariance())

components: [[ 0.99640834 -0.08467831]
             [ 0.08467831  0.99640834]]
mean:       [5.84333333 3.05733333]
covariance: [[ 0.68569351 -0.042434  ]
             [-0.042434    0.18997942]]

Dimension changing and visualizing

To project target data by using principal components, we use fit_trasform() method and visualize both original and projected data in a plot.

pca = PCA(n_components=2)
pca = pca.fit_transform(x)
plt.scatter(x[:,0],x[:,1], label = "Original")
plt.scatter(pca[:,0], pca[:,1], label = "Projected")
plt.legend(loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()

The plot shows the changed dimensions and gives the idea about dimension changing concept. The dimension of the data is centered around (0, 0).

Iris PCA example

Next, we'll do simple test with Iris dataset. To make it simple, we use only 'Sepal' width and length data. First, we'll load Iris dataset, extract target parts, and visualize it in a plot.

iris = load_iris()
x = iris.data[:, (0,1)]
y = iris.target
feature = iris.feature_names[0:2]
labels = iris.target_names

pcadata = np.hstack((x, y.reshape(150,1)))
for p1, p2, t in pcadata:
    if(t==0):
        setosa = plt.scatter(p1, p2, color='r')
    elif(t==1):
        versicolor = plt.scatter(p1, p2, color='g')
    else:
        virginica = plt.scatter(p1, p2, color='b')
    
plt.legend((setosa, versicolor, virginica), 
           labels, loc='best',fancybox=True)
plt.xlabel(feature[0])
plt.ylabel(feature[1])
plt.grid(True)
plt.show()

Next, we'll project data with PCA function and visualize it in a plot.

pca = PCA(n_components=2)
pca = pca.fit_transform(x)
pcadata = np.hstack((pca, y.reshape(150,1)))
for p1, p2, t in pcadata:
    if(t==0):
        setosa = plt.scatter(p1, p2, color='r')
    elif(t==1):
        versicolor = plt.scatter(p1, p2, color='g')
    else:
        virginica = plt.scatter(p1, p2, color='b')
    
plt.legend((setosa, versicolor, virginica), 
           labels, loc='best',fancybox=True)
plt.xlabel(feature[0])
plt.ylabel(feature[1])
plt.grid(True)
plt.show()

The plot shows the changed dimensions of the data.

In this tutorial, we've briefly learned how to use PCA and changing dimension of the feature data in Python. The full source code is listed below.

Source code listing

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

n = 100
x1 = [i/10 for i in range(n)]
x2 = np.random.uniform(-3, 5, n)+ x1
x = np.array([x1, x2]).T
plt.scatter(x[:,0],x[:,1])
plt.grid(True)
plt.show()

pc = PCA(n_components = 2)
pc = pc.fit(x)

print("components:", pc.components_)
print("mean:      ", pc.mean_)
print("covariance:", pc.get_covariance())

pca = PCA(n_components = 2)
pca = pca.fit_transform(x)
plt.scatter(x[:,0],x[:,1], label = "Original")
plt.scatter(pca[:,0], pca[:,1], label = "After PCA")
plt.legend(loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()


# Iris PCA example
iris = load_iris()
x = iris.data[:, (0,1)]
y = iris.target
feature = iris.feature_names[0:2]
labels = iris.target_names

pcadata = np.hstack((x, y.reshape(150,1)))
for p1, p2, t in pcadata:
    if(t==0):
        setosa = plt.scatter(p1, p2, color='r')
    elif(t==1):
        versicolor = plt.scatter(p1, p2, color='g')
    else:
        virginica = plt.scatter(p1, p2, color='b')
    
plt.legend((setosa, versicolor, virginica), 
           labels, loc='best',fancybox=True)
plt.xlabel(feature[0])
plt.ylabel(feature[1])
plt.grid(True)
plt.show()


pca = PCA(n_components=2)
pca = pca.fit_transform(x)
pcadata = np.hstack((pca, y.reshape(150,1)))
for p1, p2, t in pcadata:
    if(t==0):
        setosa = plt.scatter(p1, p2, color='r')
    elif(t==1):
        versicolor = plt.scatter(p1, p2, color='g')
    else:
        virginica = plt.scatter(p1, p2, color='b')
    
plt.legend((setosa, versicolor, virginica), 
           labels, loc='best',fancybox=True)
plt.xlabel(feature[0])
plt.ylabel(feature[1])
plt.grid(True)
plt.show()

References:

Scikit learn AP

DataTechNotes

Pages

Principal Component Analysis (PCA) Example in Python

No comments:

Post a Comment