DataTechNotes: Dimensionality Reduction with Sparse, Gaussian Random Projection and PCA in Python

Dimensionality reducing is used when we deal with large datasets, which contain too many feature data, to increase the calculation speed, to reduce the model size, and to visualize the huge datasets in a better way. The purpose of this method is to keep the most important data while removing the most of the feature data.

In this to tutorial, we'll briefly learn how to reduce data dimensions with Sparse and Gaussian random projection and PCA methods in Python. The Scikit-learn API provides the SparseRandomProjection, GaussianRandomProjection classes and PCA transformer function to reduce data dimension. After reading this tutorial, you'll learn how to reduce dimensionality of the dataset by using those methods. The tutorial covers:

Preparing the data
Gaussian random projection
Sparse random projection
PCA projection
MNIST data projection
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from keras.datasets import mnist
from numpy import reshape
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Preparing the data

First, we'll generate simple random data for this tutorial. Here, we'll generate a dataset with 1000 features by using make_regression() function. To apply the dimension methods into the real dataset, we also use MNIST handwritten digit database of Keras API. The MNIST is three-dimensional dataset, here we'll reshape it into the two-dimensional.

x, _ = make_regression(n_samples=50000, n_features=1000)
print(x.shape)

(50000, 1000)

(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)

(60000, 28, 28)

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

(60000, 784)

Gaussian Random Projection

Gaussian random method projects the original input space on a randomly generated matrix to reduce dimensions. We'll define the model by using the GaussionRandomProjection class by setting the components numbers. Here, we'll shrink the feature data from 1000 to 200.

grp = GaussianRandomProjection(n_components=200)
grp_data = grp.fit_transform(x)

print(grp_data.shape)

(50000, 200)

According to you analysis and target data you can set your target components.

Sparse Random Projection

Sparse random method projects the original input space using a sparse random matrix to reduce dimensions. We'll define the model by using the SparseRandomProjection class by setting the components numbers. Here, we'll shrink the feature data from 1000 to 200.

srp = SparseRandomProjection(n_components=200)
srp_data = srp.fit_transform(x)

print(srp_data.shape)

(50000, 200)

According to you analysis and target data you can set your target components.

PCA Projection

We'll define the model by using the PCA decomposition function by setting the components numbers. Here, we'll shrink the feature data from 1000 to 200.

pca = PCA(n_components=200)
pca_data = pca.fit_transform(x)

print(pca_data.shape)

(50000, 200)

According to you analysis and target data you can set your target components.

MNIST data projection

After learning the dimension reduce by using Gaussian, Sparse random and PCA methods, now we can apply those methods into the MNIST dataset. For test purpose, we'll set 2 into the components and apply the projection.

# Sparse random prejection on 2 components
srp = SparseRandomProjection(n_components = 2)
z = srp.fit_transform(x_mnist)
df_srp = pd.DataFrame()
df_srp["y"] = y_train
df_srp["comp-1"] = z[:,0]
df_srp["comp-2"] = z[:,1]

# Gaussian random prejection on 2 components
grp = GaussianRandomProjection(n_components = 2)
z = grp.fit_transform(x_mnist)
df_grp = pd.DataFrame()
df_grp["y"] = y_train
df_grp["comp-1"] = z[:,0]
df_grp["comp-2"] = z[:,1]

# PCA prejection on 2 components
pca = PCA(n_components=2)
z = pca.fit_transform(x_mnist)
df_pca = pd.DataFrame()
df_pca["y"] = y_train
df_pca["comp-1"] = z[:,0]
df_pca["comp-2"] = z[:,1]

We'll check the about projected results in a plot by visualizing them.

fig, ax = plt.subplots(3,1, figsize=(10,20))
sns.scatterplot(x="comp-1", y="comp-2", hue=df_srp.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_srp, 
                ax=ax[0]).set(title='Sparse random projection')

sns.scatterplot(x="comp-1", y="comp-2", hue=df_grp.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_grp, 
                ax=ax[1]).set(title='Gaussian random projection')

sns.scatterplot(x="comp-1", y="comp-2", hue=df_pca.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_pca, 
                ax=ax[2]).set(title="PCA projection")

The plot shows the changed dimensions of the MNIST data. The colors define the target digits and their feature data location in a plot.

In this tutorial, we've briefly learned how to reduce data dimensions with Sparse and Gaussian random projection approaches and PCA method in Python. The full source code is listed below.

Source code listing

from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from keras.datasets import mnist
from numpy import reshape
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

x, _ = make_regression(n_samples=50000, n_features=1000)
print(x.shape)

(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_minst.shape)


grp = GaussianRandomProjection(n_components=200)
grp_data = grp.fit_transform(x)
print(grp_data.shape)

srp = SparseRandomProjection(n_components=200)
srp_data = srp.fit_transform(x)
print(srp_data.shape)

pca = PCA(n_components=200)
pca_data = pca.fit_transform(x)
print(pca_data.shape)

# Sparse random prejection on 2 components
srp = SparseRandomProjection(n_components = 2)
z = srp.fit_transform(x_mnist)
df_srp = pd.DataFrame()
df_srp["y"] = y_train
df_srp["comp-1"] = z[:,0]
df_srp["comp-2"] = z[:,1]

# Gaussian random prejection on 2 components
grp = GaussianRandomProjection(n_components = 2)
z = grp.fit_transform(x_mnist)
df_grp = pd.DataFrame()
df_grp["y"] = y_train
df_grp["comp-1"] = z[:,0]
df_grp["comp-2"] = z[:,1]

# PCA prejection on 2 components
pca = PCA(n_components=2)
z = pca.fit_transform(x_mnist)
df_pca = pd.DataFrame()
df_pca["y"] = y_train
df_pca["comp-1"] = z[:,0]
df_pca["comp-2"] = z[:,1]

fig, ax = plt.subplots(3,1, figsize=(10,20))
sns.scatterplot(x="comp-1", y="comp-2", hue=df_srp.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_srp, 
                ax=ax[0]).set(title='Sparse random projection')

sns.scatterplot(x="comp-1", y="comp-2", hue=df_grp.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_grp, 
                ax=ax[1]).set(title='Gaussian random projection')

sns.scatterplot(x="comp-1", y="comp-2", hue=df_pca.y.tolist(),
                palette=sns.color_palette("hls", 10), data=df_pca, 
                ax=ax[2]).set(title="PCA projection")

References:

DataTechNotes

Pages

Dimensionality Reduction with Sparse, Gaussian Random Projection and PCA in Python

No comments:

Post a Comment