Sentiment Analysis Example with NaiveBayes Method in R

   Sentiment analysis is classifying method of the views of the sentence in a dataset like opinions, reviews, survey responses by utilizing text analysis and natural language processing (NLP) algorithms. In this post, we'll briefly learn how to classify the opinions in a dataset by using NaiveBayes method in R. We'll explore the polarity (positive or negative) of users' opinions in R.
   The tutorial covers:
  • Creating sample data
  • Preparing document matrix
  • Defining the model
  • Prediction and accuracy check
  • Source code listing
  We use the 'RTextTools' package to create a document matrix, the 'e1071' package to build a Naive Bayes model, and the 'caret' package for the accuracy check. We'll start by loading those packages.

library(RTextTools)
library(e1071)
library(caret) 


Creating sample data

   First, we'll generate sample sentences to create a training dataset for this tutorial. The sentences in a dataset are random opinions. You can add or use other sentences as input data. Our task is to find out whether the opinion is positive or negative.

sentPositive = c(
  "I like it", "like it a lot", "It's really good",
  "recommend!", "Enjoyed!", "like it",
  "It's really good", "recommend too",
  "outstanding", "good", "recommend!",
  "like it a lot", "really good", 
  "Definitely recommend!", "It is fun",
  "liked!", "highly recommend this",
  "fantastic show", "exciting",
  "Very good", "it's ok",
  "exciting show", "amazing performance",
  "it is great!","I am excited a lot",
  "it is terrific", "Definitely good one",
  "very satisfied", "Glad we went",
  "Once again outstanding!", "awesome"
)
 
sentNegative = c(
  "Not good at all!", "rude",
  "It is rude", "I don't like this type",
  "poor", "Boring", "Not good!",
  "not liked", "I hate this type of",
  "not recommend", "not satisfied",
  "not enjoyed", "Not recommend this.",
  "disgusting movie","waste of time",
  "feel tired after watching this",
  "horrible performance", "not so good",
  "so boring I fell asleep", "poor show",
  "a bit strange","terrible"
)
 
df = data.frame(sentiment = "positive", text = sentPositive)
df = rbind(df, data.frame(sentiment = "negative", text = sentNegative))

Next, we'll split the df data into the train and test parts.

index = sample(1:nrow(df), size = .9 * nrow(df))
train = df[index, ]
test = df[-index, ]
 
head(train)
   sentiment                  text
8   positive         recommend too
13  positive           really good
24  positive          it is great!
14  positive Definitely recommend!
43  negative           not enjoyed
3   positive      It's really good
 
head(test)
   sentiment          text
5   positive      Enjoyed!
20  positive     Very good
21  positive       it's ok
38  negative     Not good!
51  negative     poor show
52  negative a bit strange


Preparing document matrix

   Next, we'll create matrix data from the text of a train and test data with a crete_matrix function of the RTextTool package. The RTextTool is a package for text classification. A create_matrix() creates a document-term matrix.

mTrain = create_matrix(train[,2], language="english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=FALSE) 
matTrain = as.matrix(mTrain)

mTest = create_matrix(test[,2], language="english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=FALSE) 
matTest = as.matrix(mTest)
 
print(matTest)
               Terms
Docs            bit enjoyed good not poor show strange very
  Enjoyed!        0       1    0   0    0    0       0    0
  Very good       0       0    1   0    0    0       0    1
  it's ok         0       0    0   0    0    0       0    0
  Not good!       0       0    1   1    0    0       0    0
  poor show       0       0    0   0    1    1       0    0
  a bit strange   1       0    0   0    0    0       1    0


Defining the model

   We'll create the classifier model with NaiveBayes algorithm. To fit the model we need matrix document data and target labels.

labelTrain = as.factor(train[,1])
labelTest = as.factor(test[,1])
 
model = naiveBayes(matTrain, labelTrain)

We evaluate the fitted model.

pred = predict(model, matTrain) 
confusionMatrix(labelTrain, pred)
Confusion Matrix and Statistics

          Reference
Prediction positive negative
  positive       27        1
  negative        2       17
                                          
               Accuracy : 0.9362          
                 95% CI : (0.8246, 0.9866)
    No Information Rate : 0.617           
    P-Value [Acc > NIR] : 6.026e-07       
                                          
                  Kappa : 0.8664          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9310          
            Specificity : 0.9444          
         Pos Pred Value : 0.9643          
         Neg Pred Value : 0.8947          
             Prevalence : 0.6170          
         Detection Rate : 0.5745          
   Detection Prevalence : 0.5957          
      Balanced Accuracy : 0.9377          
                                          
       'Positive' Class : positive  


Prediction and accuracy check

Finally, we'll predict our test data with the fitted model and check the accuracy.

pred = predict(model, matTest); 
data.frame(test,pred)
   sentiment          text     pred
5   positive      Enjoyed! positive
20  positive     Very good positive
21  positive       it's ok positive
38  negative     Not good! negative
51  negative     poor show positive
52  negative a bit strange positive
 
confusionMatrix(labelTest, pred)
Confusion Matrix and Statistics

          Reference
Prediction positive negative
  positive        3        0
  negative        2        1
                                          
               Accuracy : 0.6667          
                 95% CI : (0.2228, 0.9567)
    No Information Rate : 0.8333          
    P-Value [Acc > NIR] : 0.9377          
                                          
                  Kappa : 0.3333          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 0.6000          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.3333          
             Prevalence : 0.8333          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.8000          
                                          
       'Positive' Class : positive  


   In this tutorial, we've briefly learned how to classify sentiment data with the NaiveBayes method in R. The complete code is listed below.


Source code listing

library(RTextTools)
library(e1071) 
library(caret) 
 
set.seed(12345)
 
sentPositive <- c(
  "I like it", "like it a lot", "It's really good",
  "recommend!", "Enjoyed!", "like it",
  "It's really good", "recommend too",
  "outstanding", "good", "recommend!",
  "like it a lot", "really good", 
  "Definitely recommend!", "It is fun",
  "liked!", "highly recommend this",
  "fantastic show", "exciting",
  "Very good", "it's ok",
  "exciting show", "amazing performance",
  "it is great!","I am excited a lot",
  "it is terrific", "Definitely good one",
  "very satisfied", "Glad we went",
  "Once again outstanding!", "awesome"
)
 
sentNegative <- c(
  "Not good at all!", "rude",
  "It is rude", "I don't like this type",
  "poor", "Boring", "Not good!",
  "not liked", "I hate this type of",
  "not recommend", "not satisfied",
  "not enjoyed", "Not recommend this.",
  "disgusting movie","waste of time",
  "feel tired after watching this",
  "horrible performance", "not so good",
  "so boring I fell asleep", "poor show",
  "a bit strange","terrible"
)
 
df = data.frame(sentiment="positive", text=sentPositive)
df = rbind(df, data.frame(sentiment="negative", text=sentNegative))
 
index = sample(1:nrow(df), size = .9 * nrow(df))

train = df[index, ]
test = df[-index, ]
 
head(train)
head(test)
 
mTrain = create_matrix(train[,2], language = "english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=FALSE) 
matTrain = as.matrix(mTrain)

mTest = create_matrix(test[,2], language = "english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=FALSE) 
matTest = as.matrix(mTest)


labelTrain = as.factor(train[,1])
labelTest = as.factor(test[,1])
 
model = naiveBayes(matTrain, labelTrain)

pred = predict(model, matTrain) 
confusionMatrix(labelTrain, pred)
 
pred = predict(model, matTest)
data.frame(test, pred)
 
confusionMatrix(labelTest, pred)
 
 

1 comment:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Software

    Data Scraping Tools

    ReplyDelete