T-distributed Stochastic Neighbor Embedding (T-SNE) is a tool for visualizing high-dimensional data. T-SNE, based on stochastic neighbor embedding, is a nonlinear dimensionality reduction technique to visualize data in a two or three dimensional space.
The Scikit-learn API provides TSNE class to visualize data with T-SNE method. In this tutorial, we'll briefly learn how to fit and visualize data with TSNE in Python. The tutorials covers:
- Iris dataset TSNE fitting and visualizing
- MNIST dataset TSNE fitting and visualizing
- Source code listing
We'll start by loading the required libraries and functions.
from sklearn.manifold import TSNE
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
Iris dataset TSNE fitting and visualizing
After loading the Iris dataset, we'll get the data and label parts of the dataset.
iris = load_iris()
x = iris.data
y = iris.target
Then, we'll define the model by using the TSNE class, here the n_components parameter defines the number of target dimensions. The 'verbose=1' shows the log data so we can check it.
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 150 samples in 0.001s...
[t-SNE] Computed neighbors for 150 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 150 / 150
[t-SNE] Mean sigma: 0.509910
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.021526
[t-SNE] KL divergence after 1000 iterations: 0.122989
Next, we'll visualize the result in a plot. We'll collect the output component data in a dataframe, then we use 'seaborn' library's scatterplot() to plot the data. In color palette of scatter plot, we'll set 3 because there are 3 types categories in label data.
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 3),
data=df).set(title="Iris data T-SNE projection")
MNIST dataset TSNE fitting and visualizing
Next, we'll apply the same method to the larger dataset. MNIST handwritten digit dataset works well for this purpose and we can use Keras API's MNIST data. We extract only train part of the dataset because here it is enough to test data with TSNE. The TSNE requires too much time to process thus, I'll use only 3000 rows.
(x_train, y_train), (_ , _) = mnist.load_data()
x_train = x_train[:3000]
y_train = y_train[:3000]
print(x_train.shape)
(3000, 28, 28)
MNIST is a three-dimensional data, we'll reshape it into the two-dimensional one.
print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)
(3000, 784)
Here, we have 784 features data. Now, we'll project it into two dimensions with TSNE and visualize it in a plot.
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data T-SNE projection")
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 3000 samples in 0.922s...
[t-SNE] Computed neighbors for 3000 samples in 10.601s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3000
[t-SNE] Computed conditional probabilities for sample 2000 / 3000
[t-SNE] Computed conditional probabilities for sample 3000 / 3000
[t-SNE] Mean sigma: 480.474473
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.815109
[t-SNE] KL divergence after 1000 iterations: 1.261612
The
plot shows a two-dimensional visualization of the MNIST data. The colors define
the target digits and their feature data location in 2D space.
In this tutorial, we've briefly learned how to how to fit and visualize data with TSNE in Python. The full source code is listed below.
Source code listing
from sklearn.manifold import TSNE
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
iris = load_iris()
x = iris.data
y = iris.target
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 3),
data=df).set(title="Iris data T-SNE projection")
(x_train, y_train), (_ , _) = mnist.load_data()
x_train = x_train[:3000]
y_train = y_train[:3000]
print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data T-SNE projection")
References:
how to add a csv file with tsne ?
ReplyDeleteYou can load csv file and convert it to np array. Then you can use it.
Deletethank you
ReplyDeleteThank you so much!
ReplyDelete