DataTechNotes: TSNE Visualization Example in Python

T-distributed Stochastic Neighbor Embedding (T-SNE) is a tool for visualizing high-dimensional data. T-SNE, based on stochastic neighbor embedding, is a nonlinear dimensionality reduction technique to visualize data in a two or three dimensional space.

The Scikit-learn API provides TSNE class to visualize data with T-SNE method. In this tutorial, we'll briefly learn how to fit and visualize data with TSNE in Python. The tutorials covers:

Iris dataset TSNE fitting and visualizing
MNIST dataset TSNE fitting and visualizing
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.manifold import TSNE
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd

Iris dataset TSNE fitting and visualizing

After loading the Iris dataset, we'll get the data and label parts of the dataset.

iris = load_iris()
x = iris.data
y = iris.target

Then, we'll define the model by using the TSNE class, here the n_components parameter defines the number of target dimensions. The 'verbose=1' shows the log data so we can check it.

tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 150 samples in 0.001s...
[t-SNE] Computed neighbors for 150 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 150 / 150
[t-SNE] Mean sigma: 0.509910
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.021526
[t-SNE] KL divergence after 1000 iterations: 0.122989

Next, we'll visualize the result in a plot. We'll collect the output component data in a dataframe, then we use 'seaborn' library's scatterplot() to plot the data. In color palette of scatter plot, we'll set 3 because there are 3 types categories in label data.

df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 3),
                data=df).set(title="Iris data T-SNE projection")

MNIST dataset TSNE fitting and visualizing

Next, we'll apply the same method to the larger dataset. MNIST handwritten digit dataset works well for this purpose and we can use Keras API's MNIST data. We extract only train part of the dataset because here it is enough to test data with TSNE. The TSNE requires too much time to process thus, I'll use only 3000 rows.

(x_train, y_train), (_ , _) = mnist.load_data()
x_train = x_train[:3000]
y_train = y_train[:3000]

print(x_train.shape)

(3000, 28, 28)

MNIST is a three-dimensional data, we'll reshape it into the two-dimensional one.

print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

(3000, 784)

Here, we have 784 features data. Now, we'll project it into two dimensions with TSNE and visualize it in a plot.

tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data T-SNE projection")

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.922s...
[t-SNE] Computed neighbors for 3000 samples in 10.601s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 480.474473
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.815109
[t-SNE] KL divergence after 1000 iterations: 1.261612

The plot shows a two-dimensional visualization of the MNIST data. The colors define the target digits and their feature data location in 2D space.

In this tutorial, we've briefly learned how to how to fit and visualize data with TSNE in Python. The full source code is listed below.

Source code listing

from sklearn.manifold import TSNE
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd

iris = load_iris()
x = iris.data
y = iris.target

tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 3),
                data=df).set(title="Iris data T-SNE projection")

(x_train, y_train), (_ , _) = mnist.load_data()
x_train = x_train[:3000]
y_train = y_train[:3000]
print(x_train.shape)

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data T-SNE projection")

References:

Scikit-learn TSNE

DataTechNotes

Pages

TSNE Visualization Example in Python

4 comments: