Word Embedding Example with Keras in Python

   A word embedding is a vector representation of a text arranged by similarity of words. This kind of representation helps to present the information in lower-dimensional vectors and extract the semantic meaning of words by mapping them into a geometric space. Keras provides useful methods to implement a word embedding in neural network models. Here, we'll briefly learn how to apply word embedding for binary classification of sentiment text data and apply it into the keras neural networks model. The post covers:
  1. Preparing the data
  2. Defining the keras model
  3. Predicting test data
We'll start by loading the required libraries.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd


Preparing the data

   I prepared a simple sentiment data for this tutorial. The data contains imaginary random opinions that positive opinion labeled '1' and negative opinion with '0'. The below is sample content of sentiment training data. You can find the full list of the sentiment data in this link and save it as a sentiments.csv file on your target folder.

1,"I like it "
1,"like it a lot "
1,"It's really good "
1,"Recommend! I really enjoyed! "
1,"It's really good "
1,"recommend too "
1,"outstanding performance "
...
0,"it's mediocre! not recommend "
0,"Not good at all! "
0,"It is rude "
0,"I don't like this type "
0,"poor performance "
0,"Boring, not good at all! "
0,"not liked "
0,"I hate this type of things "
...


First, we'll load text data and split into the train and test parts.

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.1, random_state=123)

Next, we'll vectorize text data with the Tokenizer() method.

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(x)
xtrain= tokenizer.texts_to_sequences(x_train)
xtest= tokenizer.texts_to_sequences(x_test)
print(xtest)
[[6, 42, 43, 1, 15], [21, 14], [76, 6, 77, 2, 78], \
[17, 1, 25, 53], [2, 5, 2, 24], [17, 1, 25, 12]] 

We'll apply a padding method to add zeros and set the fixed size into each vector.

maxlen=20
xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen)
xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)
print(xtest)
[[ 6 42 43  1 15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [21 14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [76  6 77  2 78  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [17  1 25 53  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  5  2 24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [17  1 25 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]] 


Defining the keras model

Before creating the keras model we need to define vocabulary size and embedding dimension. We can get the size from the tokenizer's word index.

vocab_size=len(tokenizer.word_index)+1
embedding_dim=50

Next, we'll create a keras sequential model, add the Embedding layer and the other layers into the model, and compile it.

model=Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
                           output_dim=embedding_dim,
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", 
     metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 20, 50)            4450      
_________________________________________________________________
flatten_28 (Flatten)         (None, 1000)              0         
_________________________________________________________________
dense_42 (Dense)             (None, 16)                16016     
_________________________________________________________________
dense_43 (Dense)             (None, 1)                 17        
=================================================================
Total params: 20,483
Trainable params: 20,483
Non-trainable params: 0
_________________________________________________________________


Finally, we'll train the model and check the training accuracy.

model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)
loss, acc = model.evaluate(xtrain, y_train, verbose=False)
print("Training Accuracy: ", acc.round(2))
Training Accuracy:  0.8 


Predicting test data

We can predict test data and check the result accuracy.

ypred=model.predict(xtest)

ypred[ypred>0.5]=1 
ypred[ypred<=0.5]=0 
cm = confusion_matrix(y_test, ypred)
print(cm)
[[2 0]
 [1 3]] 


Printing the test data content and its original and predicted values.

result=zip(x_test, y_test, ypred)
for i in result:
 print(i)
 
('I am excited a lot ', 1, array([0.], dtype=float32))
('exciting, liked. ', 1, array([1.], dtype=float32))
('terrible! I did not expect. ', 0, array([0.], dtype=float32))
('What a nice restaurant.', 1, array([1.], dtype=float32))
('not recommend, not satisfied ', 0, array([0.], dtype=float32))
('What a nice show.', 1, array([1.], dtype=float32)) 

   In this post, we've briefly learned how to implement word embedding for binary classification of text data with keras.
   The full source code is listed below.

 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
import pandas as pd

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.1, random_state=123)

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(x)
xtrain= tokenizer.texts_to_sequences(x_train)
xtest= tokenizer.texts_to_sequences(x_test)

vocab_size=len(tokenizer.word_index)+1
maxlen=20
 
xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen)
xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)

embedding_dim=50
model=Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
      output_dim=embedding_dim,
      input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", 
     metrics=['accuracy'])
model.summary()

model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)
loss, acc = model.evaluate(xtrain, y_train, verbose=False)
print("Training Accuracy: ", acc.round(2))

ypred=model.predict(xtest)

ypred[ypred>0.5]=1 
ypred[ypred<=0.5]=0 
cm = confusion_matrix(y_test, ypred)
print(cm)

result=zip(x_test, y_test, ypred)
for i in result:
 print(i)


No comments:

Post a Comment