DataTechNotes: Dimensionality Reduction Example with Factor Analysis in Python

Factor Analysis is a technique that used to express data with reduced number of variables. Reducing the number of variables in a data is helpful method to simplify large dataset by decreasing the variables without loosing the generality of it.

The Scikit-learn API provides the FactorAnalysis model that performs a maximum likelihood estimate of loading matrix using SVD based approach. In this tutorial, we'll briefly learn how to use FactorAnalysis model to reduce the data dimension and visualize the output in Python. The tutorials covers:

MNIST dataset Projection with Factor Analysis
Image data Factor Analysis and visualizing
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.decomposition import FactorAnalysis 
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
from numpy import where

import matplotlib.pyplot as plt

MNIST dataset projection with factor analysis

We load MNIST handwritten digit dataset provided by Keras library. We'll check the dimensions of x part of data and transform it into the two-dimensional data.

(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)

(60000, 28, 28)

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

(60000, 784)

Next, we'll define the model by using the FactorAnalysis class, here the n_components parameter defines the number of target dimensions.

fa = FactorAnalysis(n_components=2, random_state=123)
z = fa.fit_transform(x_mnist)

To visualize the transformed data, we'll collect the output component in a dataframe and plot it by using the 'seaborn' library's scatterplot(). In color palette of scatter plot, we'll set 10 because there are 10 type of categories in label data.

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data projection with Factor Analysis")

The plot shows a two-dimensional visualization of the MNIST data. The colors define the target digits and their feature data location in 2D space.

Image data Factor Analysis and visualizing

Next, we'll apply the factor analysis method to image data. Here, we use digit '3' x and y data. We can extract and reshape data as below.

digit3_y = where(y_train==3)
digit3_x = x_train[digit3_y]

x_mnist = reshape(digit3_x, [digit3_x.shape[0], digit3_x.shape[1]*digit3_x.shape[2]])
print(x_mnist.shape)

(6131, 784)

Here, we have 784 features and 6131 sample images. We'll fit FactorAnalysis model on x_mnist data and visualize the output images.

fa = FactorAnalysis(n_components=10, random_state=123)
z = fa.fit(x_mnist)

print(z.components_.shape)

plt.subplots_adjust(wspace=0, hspace=0)
plt.tight_layout()
plt.gray()
for i in range(0, 9):	
    plt.subplot(3, 3, i + 1) 
    plt.tick_params(labelbottom=False)
    plt.tick_params(labelleft=False)
    plt.imshow(z.components_[i].reshape(28,28), )
    
plt.show()

The plot shows nine samples of output image data.

In this tutorial, we've briefly learned how to how to use Sklearn's FactorAnalysis model to reduce dimensions of data data in Python. The full source code is listed below.

Source code listing

from sklearn.decomposition import FactorAnalysis 
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
from numpy import where
import matplotlib.pyplot as plt

(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape) 

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

fa = FactorAnalysis(n_components=2, random_state=123)
z = fa.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data projection with Factor Analysis")



digit3_y = where(y_train==3)
digit3_x = x_train[digit3_y]

x_mnist = reshape(digit3_x, [digit3_x.shape[0], digit3_x.shape[1]*digit3_x.shape[2]])
print(x_mnist.shape)

fa = FactorAnalysis(n_components=10, random_state=123)
z = fa.fit(x_mnist)

print(z.components_.shape)

plt.subplots_adjust(wspace=0, hspace=0)
plt.tight_layout()
plt.gray()
for i in range(0, 9):	
    plt.subplot(3, 3, i + 1) 
    plt.tick_params(labelbottom=False)
    plt.tick_params(labelleft=False)
    plt.imshow(z.components_[i].reshape(28,28), )
    
plt.show()

References:

Scikit-learn FactorAnalysis

DataTechNotes

Pages

Dimensionality Reduction Example with Factor Analysis in Python

No comments:

Post a Comment