In this post, we'll learn how to classify data with a gbm (Generalized Boosted Model) package's gbm (Gradient Boosting Model) method. This package applies J. Friedman's gradient boosting machines and Adaboot algorithms. The tutorial covers:

- Preparing data
- Classification with gbm
- Classification with caret train method

`library(gbm)`

`library(caret) `

**Preparing data**

We'll use iris dataset as classification data and prepare it by splitting into the train and test parts. Here, we'll use 10 percent of the dataset as test data.

indexes = createDataPartition(iris$Species, p = .90, list = F) train = iris[indexes, ] test = iris[-indexes, ]

**Classification with gbm**

We'll define the gbm model and include train data to fit the model. Here, we'll set multinomial distribution, 10 cross-validation fold, and 200 trees.

mod_gbm = gbm(Species ~., data = train, distribution = "multinomial", cv.folds = 10, shrinkage = .01, n.minobsinnode = 10, n.trees = 200)

print(mod_gbm) gbm(formula = Species ~ ., distribution = "multinomial", data = train, n.trees = 200, n.minobsinnode = 10, shrinkage = 0.01, cv.folds = 10) A gradient boosted model with multinomial loss function. 200 iterations were performed. The best cross-validation iteration was 200. There were 4 predictors of which 3 had non-zero influence.

The model is ready, and we'll predict test data.

pred = predict.gbm(object = mod_gb, newdata = test, n.trees = 200, type = "response")

The predicted result is not easy-readable data so we'll get class names with the highest prediction value.

labels = colnames(pred)[apply(pred, 1, which.max)] result = data.frame(test$Species, labels)

print(result) test.Species labels 1 setosa setosa 2 setosa setosa 3 setosa setosa 4 setosa setosa 5 setosa setosa 6 versicolor versicolor 7 versicolor versicolor 8 versicolor versicolor 9 versicolor virginica 10 versicolor versicolor 11 virginica versicolor 12 virginica virginica 13 virginica virginica 14 virginica virginica 15 virginica virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(labels)) print(cm) Confusion Matrix and Statistics Reference Prediction setosa versicolor virginica setosa 5 0 0 versicolor 0 4 1 virginica 0 1 4 Overall Statistics Accuracy : 0.8667 95% CI : (0.5954, 0.9834) No Information Rate : 0.3333 P-Value [Acc > NIR] : 3.143e-05 Kappa : 0.8 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0000 0.8000 0.8000 Specificity 1.0000 0.9000 0.9000 Pos Pred Value 1.0000 0.8000 0.8000 Neg Pred Value 1.0000 0.9000 0.9000 Prevalence 0.3333 0.3333 0.3333 Detection Rate 0.3333 0.2667 0.2667 Detection Prevalence 0.3333 0.3333 0.3333 Balanced Accuracy 1.0000 0.8500 0.8500

**Classification with caret train method**

The second method I would like to share here is that using the caret train method for model fitting. Train method requires train control parameter and we can define it as below.

`tc = trainControl(method = "repeatedcv", number = 10)`

Next, we'll define the model and train it with train data.

`model = train(Species ~., data=train, method="gbm", trControl=tc)`

We can predict test data with the fitted model.

pred = predict(model, test) result = data.frame(test$Species, pred) print(result) test.Species pred 1 setosa setosa 2 setosa setosa 3 setosa setosa 4 setosa setosa 5 setosa setosa 6 versicolor versicolor 7 versicolor versicolor 8 versicolor versicolor 9 versicolor versicolor 10 versicolor versicolor 11 virginica virginica 12 virginica versicolor 13 virginica virginica 14 virginica virginica 15 virginica virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(pred)) print(cm) Confusion Matrix and Statistics Reference Prediction setosa versicolor virginica setosa 5 0 0 versicolor 0 5 0 virginica 0 1 4 Overall Statistics Accuracy : 0.9333 95% CI : (0.6805, 0.9983) No Information Rate : 0.4 P-Value [Acc > NIR] : 2.523e-05 Kappa : 0.9 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0000 0.8333 1.0000 Specificity 1.0000 1.0000 0.9091 Pos Pred Value 1.0000 1.0000 0.8000 Neg Pred Value 1.0000 0.9000 1.0000 Prevalence 0.3333 0.4000 0.2667 Detection Rate 0.3333 0.3333 0.2667 Detection Prevalence 0.3333 0.3333 0.3333 Balanced Accuracy 1.0000 0.9167 0.9545

In this tutorial, we've learned how to classify data with gbm method in R. Thank you for reading!

The full source code is listed below.

```
library(gbm)
library(caret)
indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
mod_gbm = gbm(Species ~.,
data = train,
distribution = "multinomial",
cv.folds = 10,
shrinkage = .01,
n.minobsinnode = 10,
n.trees = 200)
print(mod_gbm)
pred = predict.gbm(object = mod_gbm,
newdata = test,
n.trees = 200,
type = "response")
labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)
print(result)
cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)
# caret train method
tc = trainControl(method = "repeatedcv", number = 10)
model = train(Species ~., data=train, method="gbm", trControl=tc)
print(model)
pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)
cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)
``````
```

Hello there! This is my first comment here, so I just wanted to give a quick shout out and say I genuinely enjoy reading your articles. Can you recommend any other blogs/websites/forums that deal with the same subjects? Thanks.

ReplyDeleteSurya Informatics