Sentiment Classification Example With Gaussian Naive Bayes in Python

   In this tutorial, we'll learn how to classify text data into positive and negative sentiments in Python. We'll use the CountVectorizer class to build a vector and apply the Gaussian Naive Bayes method to classify data. Both classes are available in a sklearn library. The post covers:
  • Preparing data
  • Vectorizing texts
  • Training the model and predicting the test data
  • Source code listing
We'll start by loading the required libraries.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix


Preparing data

   Here, I collected a simple sentiment data for this tutorial. The data contains imaginary random opinions that positive opinion labeled '1' and negative opinion with '0'. The below is a sample content of sentiment training data.

1,"I like it "
1,"like it a lot "
1,"It's really good "
1,"Recommend! I really enjoyed! "
1,"It's really good "
1,"recommend too "
1,"outstanding performance "
...
0,"it's mediocre! not recommend "
0,"Not good at all! "
0,"It is rude "
0,"I don't like this type "
0,"poor performance "
0,"Boring, not good at all! "
0,"not liked "
0,"I hate this type of things "
...


You can find the full list of the sentiment data below. Copy the text and save it as a sentiments.csv on your target folder.

Next, we'll load the sentiments.csv data and separate it into x and y parts.

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

To train the model and to predict new data, we'll split the data into train and test parts.

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.12, random_state=121)


Vectorizing texts

CountVectorizer() class helps us to build a vector from the text data. We'll create matrix data from the train and test text vectors.

vectorizer = CountVectorizer()
vectorizer.fit(x_train)
Xtrain = vectorizer.transform(x_train)
Xtest = vectorizer.transform(x_test)
print(Xtrain.shape)
(42, 67)
print(Xtest.shape)
(6, 67) 


Training the model and predicting the test data

Next, we'll build the Gaussian Naive Bayes model and train it with training data.

model = GaussianNB().fit(Xtrain.toarray(), y_train)

Finally, we'll predict the test data and check the accuracy.

ypred = model.predict(Xtest.toarray())
accuracy = accuracy_score(y_test, ypred)
cm = confusion_matrix(y_test, ypred)

print("Accuracy: ", accuracy)
Accuracy:  0.8333333333333334
print("Confusion matrix:")
print(cm)
Confusion matrix:
[[2 1]
 [0 3]] 
 
result=zip(x_test, y_test, ypred)
for i in result:
 print(i)
 
("it's good! recommend! ", 1, 1)
('This is truly a good one! ', 1, 1)
('It is rude ', 0, 1)
('Nasty and horrible! ', 0, 0)
('waste of time, poor show ', 0, 0)
('exciting show ', 1, 1) 


   In this post, we've briefly learned sentiment classification in python. Although the accuracy has reached 83 percent, the model needs a larger training dataset to improve its prediction accuracy.
The full source code is listed below.


Source code listing

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix

df = pd.read_csv('datasets/sentiments.csv')
df.columns = ["label","text"]
x = df['text'].values
y = df['label'].values

x_train, x_test, y_train, y_test = \
 train_test_split(x, y, test_size=0.12, random_state=121)

vectorizer = CountVectorizer()
vectorizer.fit(x_train)
Xtrain = vectorizer.transform(x_train)
Xtest = vectorizer.transform(x_test)
print(Xtrain.shape)
print(Xtest.shape)

model = GaussianNB().fit(Xtrain.toarray(), y_train)

ypred = model.predict(Xtest.toarray())
accuracy = accuracy_score(y_test, ypred)
cm = confusion_matrix(y_test, ypred)

print("Accuracy: ", accuracy)
print("Confusion matrix:")
print(cm)

result=zip(x_test, y_test, ypred)
for i in result:
 print(i)

sentiments.csv data

1,"I like it "
1,"like it a lot "
1,"It's really good "
1,"Recommend! I really enjoyed! "
1,"It's really good "
1,"recommend too "
1,"outstanding performance "
1,"it's good! recommend! "
1,"Great! "
1,"really good. Definitely, recommend! "
1,"It is fun "
1,"Exceptional! liked a lot! "
1,"highly recommend this "
1,"fantastic show "
1,"exciting, liked. "
1,"it's ok "
1,"exciting show "
1,"amazing performance "
1,"it is great! "
1,"I am excited a lot "
1,"it is terrific "
1,"Definitely good one "
1,"Excellent, very satisfied "
1,"Glad we went "
1,"Once again outstanding! "
1,"awesome! excellent show "
1,"This is truly a good one! "
0,"it's mediocre! not recommend "
0,"Not good at all! "
0,"It is rude "
0,"I don't like this type "
0,"poor performance "
0,"Boring, not good at all! "
0,"not liked "
0,"I hate this type of things "
0,"not recommend, not satisfied "
0,"not enjoyed, I don't recommend this. "
0,"disgusting movie "
0,"waste of time, poor show "
0,"feel tired after watching this "
0,"horrible performance "
0,"not so good "
0,"so boring I fell asleep "
0,"a bit strange "
0,"terrible! I did not expect. "
0,"This is an awful "
0,"Nasty and horrible! "
0,"Offensive, it is a crap! "
0,"Disappointing! not liked. "

No comments:

Post a Comment