Text Classification Example with Keras LSTM in Python

   LSTM (Long-Short Term Memory) is a type of Recurrent Neural Network and it is used to
learn a sequence data in deep learning. In this post, we'll learn how to apply LSTM for binary text classification problem. The post covers:
  1. Preparing data
  2. Defining the LSTM model
  3. Predicting test data
We'll start by loading required libraries.


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd


Preparing data

   Here, I prepared a simple sentiment data for this tutorial. The data contains imaginary random opinions that positive opinion labeled '1' and negative opinion with '0'. The below is sample content of sentiment training data. You can find the full list of the sentiment data in this link and save it as a sentiments.csv file on your target folder.

1,"I like it "
1,"like it a lot "
1,"It's really good "
1,"Recommend! I really enjoyed! "
1,"It's really good "
1,"recommend too "
1,"outstanding performance "
...
0,"it's mediocre! not recommend "
0,"Not good at all! "
0,"It is rude "
0,"I don't like this type "
0,"poor performance "
0,"Boring, not good at all! "
0,"not liked "
0,"I hate this type of things "
...


We'll load text data and split it into the train and test parts.

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.1, random_state=123)

Next, we'll convert text data into token vectors.

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(x)
xtrain= tokenizer.texts_to_sequences(x_train)
xtest= tokenizer.texts_to_sequences(x_test) 

We'll apply a padding method to add zeros and set the fixed size into each vector.

maxlen=10
xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen)
xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)

print(x_train[3])
Excellent, very satisfied

print(xtrain[3])
[23 45 24  0  0  0  0  0  0  0] 


Defining the LSTM model

We apply the Embedding layer for input data before adding the LSTM layer into the Keras sequential model. The model definition goes as a following.

embedding_dim=50
model=Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
      output_dim=embedding_dim,
      input_length=maxlen))
model.add(layers.LSTM(units=50,return_sequences=True))
model.add(layers.LSTM(units=10))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(8))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", 
     metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_24 (Embedding)     (None, 10, 50)            4450      
_________________________________________________________________
lstm_40 (LSTM)               (None, 10, 50)            20200     
_________________________________________________________________
lstm_41 (LSTM)               (None, 10)                2440      
_________________________________________________________________
dropout_16 (Dropout)         (None, 10)                0         
_________________________________________________________________
dense_65 (Dense)             (None, 8)                 88        
_________________________________________________________________
dense_66 (Dense)             (None, 1)                 9         
=================================================================
Total params: 27,187
Trainable params: 27,187
Non-trainable params: 0
_________________________________________________________________ 


Finally, we'll train the model and check the training accuracy.

model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)

loss, acc = model.evaluate(xtrain, y_train, verbose=False)
print("Training Accuracy: ", acc.round(2))
Training Accuracy:  1.0 
 
loss, acc = model.evaluate(xtest, y_test, verbose=False)
print("Test Accuracy: ", acc.round(2))
Test Accuracy:  1.0 


Predicting test data

Finally, we can predict test data and check the prediction accuracy.

ypred=model.predict(xtest)

ypred[ypred>0.5]=1 
ypred[ypred<=0.5]=0 
cm = confusion_matrix(y_test, ypred)
print(cm)
[[2 0]
 [0 4]] 
 
result=zip(x_test, y_test, ypred)
for i in result:
 print(i)
 
('I am excited a lot ', 1, array([1.], dtype=float32))
('exciting, liked. ', 1, array([1.], dtype=float32))
('terrible! I did not expect. ', 0, array([0.], dtype=float32))
('What a nice restaurant.', 1, array([1.], dtype=float32))
('not recommend, not satisfied ', 0, array([0.], dtype=float32))
('What a nice show.', 1, array([1.], dtype=float32)) 


   In this post, we've briefly learned how to implement LSTM for binary classification of text data with Keras. The source code is listed below.

 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.1, random_state=123)

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(x)
xtrain= tokenizer.texts_to_sequences(x_train)
xtest= tokenizer.texts_to_sequences(x_test)

vocab_size=len(tokenizer.word_index)+1

maxlen=10
xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen)
xtest=pad_sequences(xtest,padding='post', maxlen=maxlen) 
 
print(x_train[3])
print(xtrain[3])
 

embedding_dim=50
model=Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
         output_dim=embedding_dim,
         input_length=maxlen))
model.add(layers.LSTM(units=50,return_sequences=True))
model.add(layers.LSTM(units=10))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(8))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", 
     metrics=['accuracy'])
model.summary()
model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)

loss, acc = model.evaluate(xtrain, y_train, verbose=False)
print("Training Accuracy: ", acc.round(2))
loss, acc = model.evaluate(xtest, y_test, verbose=False)
print("Test Accuracy: ", acc.round(2))

ypred=model.predict(xtest)

ypred[ypred>0.5]=1 
ypred[ypred<=0.5]=0 
cm = confusion_matrix(y_test, ypred)
print(cm)

result=zip(x_test, y_test, ypred)
for i in result:
 print(i)
 


4 comments:

  1. Thanks for sharing.
    I have tried your code, and getting error in this part (for training the model) :
    model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)...
    The error is :
    UnimplementedError: Cast string to float is not supported.
    Do you have any idea why this happened?
    I really stuck on this.
    Thank you very much

    ReplyDelete
    Replies
    1. im getting the same, did you figure it out?

      Delete
  2. what bout the a Bidirectional LSTM

    ReplyDelete