Classification Example with LogitBoost Method in R

   The LogitBoost, introduced by Friedman et al, is based on the logistic regression method of the AdaBoost model. Issues like overfitting because of the noise and outliers in data and exponential loss function (errors also change exponentially) decline a boosting model accuracy.
   Linearly changing of classification errors instead of exponentially improves the model accuracy and decreases its vulnerability to noises in data.
   In this post, we'll learn how to classify data with the 'LogitBoost' function in R. The 'caTools' package provides LogitBoost function. The tutorial covers:
  1. Preparing the data
  2. Fitting the model and prediction
  3. Source code listing
We'll start by loading the required packages for this tutorial. You may need to install them if they are not available on your work environment.

library(caTools)
library(caret)


Preparing the data 
 
   We'll use the 'iris' dataset as a target classification dataset in this tutorial. First, we'll load it and split it into the train and test parts.

data("iris")
set.seed(123)
 
indexes = createDataPartition(iris$Species, p = .9, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]

Next, we'll separate x input and y label parts for the train and test data. Here, column 5 is a Y label data in the Iris dataset.

xtrain = train[, -5]
ytrain = train[, 5]
xtest = test[, -5]
ytest = test[, 5]


Fitting the model and prediction 

   Next, we'll define the model and fit it with the train data. Here nIter defines the iteration number.

logBoost = LogitBoost(xtrain, ytrain, nIter=50)
print(logBoost)

You can check the fitted model by using the print command.
Now, we can predict the test data with the trained model.

yhat = predict(logBoost, xtest)

Next, we'll check the prediction accuracy with the confusion matrix function. 

cm = confusionMatrix(ytest, yhat)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          0         5

Overall Statistics
                                    
               Accuracy : 1         
                 95% CI : (0.782, 1)
    No Information Rate : 0.3333    
    P-Value [Acc > NIR] : 6.969e-08 
                                    
                  Kappa : 1         
 Mcnemar's Test P-Value : NA        

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000 


   In this tutorial, we've briefly learned how to classify data with the LogitBoosting function in R. The full source code is listed below.


Source code listing 

library(caTools)
library(caret)
 
data("iris")
set.seed(123)
 
indexes = createDataPartition(iris$Species, p = .9, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
 
xtrain = train[, -5]
ytrain = train[, 5]
xtest = test[, -5]
ytest = test[, 5]
 
logBoost = LogitBoost(xtrain, ytrain, nIter=50)
print(logBoost) 
 
yhat = predict(logBoost, xtest)
 
cm = confusionMatrix(ytest, yhat)
print(cm)  


References for reading:

1. http://www.cis.upenn.edu/~mkearns/teaching/COLT/schapire.pdf
2. http://stat.ethz.ch/~dettling/boosting.html

No comments:

Post a Comment