LightGBM Regression Example in R

    LightGBM is an open-source gradient boosting framework that based on tree learning algorithm and designed to process data faster and provide better accuracy. LightGBM can be used for regression, classification, ranking and other machine learning tasks.  

    In this tutorial, we'll briefly learn how to fit and predict regression data by using LightGBM in R. The tutorial covers:

  1. Preparing the data
  2. Fitting the model and prediction
  3. Accuracy checking
  4. Source code listing
    We'll start by installing R interface package of LightGBM API and loading the required packages.

 
install.packages("lightgbm")
 
library(lightgbm)
library(caret)
library(ggplot2) 
 

Preparing the data

   We use Boston housing-price dataset as a target regression data in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, we'll extract 15 percent of the dataset as test data. It is better to scale x part of data to improve the accuracy.

 
boston = MASS::Boston
str(boston)
dim(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = train[, -14]
train_x = scale(train_x)[,]
train_y = train[,14]

test_x = test[, -14]
test_x = scale(test[,-14])[,]
test_y = test[,14]
 
 

Next, we'll load the train and test data into the LightGBM dataset object. Below code shows how to load train and evaluation test data. 
 
 
dtrain = lgb.Dataset(train_x, label = train_y)
dtest = lgb.Dataset.create.valid(dtrain, test_x, label = test_y)

 

Building model and prediction

   First, we'll define regression model parameters and validation data as shown below. You can change values according to your evaluation targets.


# define parameters
params = list(
objective = "regression"
, metric = "l2"
, min_data = 1L
, learning_rate = .3
)
 
# validataion data
valids = list(test = dtest)
 
 
Now, we can train the model with defined parameters above. 
 

# train model 
model = lgb.train(
params = params
, data = dtrain
, nrounds = 5L
, valids = valids
)
 
 
We can check L2 values for test dataset. It show 5 round outputs of L2.

 
lgb.get.eval.result(model, "test", "l2")
[1] 52.80909 37.78159 31.03807 27.43076 26.19789

 
Now, we can predict the x test data with the trained model.

 
# prediction
pred_y = predict(model, test_x) 
 


Accuracy checking

    We'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

 
# accuracy check
mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse)
 
MSE:  26.19789 
MAE: 3.570257
RMSE: 5.118387
 
 

We can also visualize original and predicted test data in a plot.

 
# visualize the result in a plot
df = data.frame(test_y, pred_y)
df$id = 1:nrow(df)

ggplot() + geom_line(data = df, aes(x = id, y = test_y, color = 'test_y')) + 
geom_line(data = df, aes(x=id, y = pred_y, color = 'pred_y')) +
ggtitle("Boston housing test data prediction") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab('medv')

  

Finally, we'll find the top 5 important features of training data and visualize it in a graph.

 
# feature importance
tree_imp = lgb.importance(model, percentage = TRUE)
lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")



   In this tutorial, we've briefly learned how to fit and predict regression data with LightGBM method in R. The full source code is listed below.


Source code listing


library(lightgbm)
library(caret)
library(ggplot2)

boston = MASS::Boston
dim(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[-indexes, ]
test = boston[indexes, ]

train_x = train[, -14]
train_x = scale(train_x)[,]
train_y = train[,14]

test_x = test[, -14]
test_x = scale(test[,-14])[,]
test_y = test[,14]

dtrain
= lgb.Dataset(train_x, label = train_y)
dtest
= lgb.Dataset.create.valid(dtrain, test_x, label = test_y)

# define parameters
params
= list(
objective = "regression"
, metric = "l2"
, min_data = 1L
, learning_rate = .3
)

# validataion data
valids
= list(test = dtest)

# train model
model
= lgb.train(
params = params
, data = dtrain
, nrounds = 5L
, valids = valids
)

# Get L2 values for "test" dataset

lgb.get.eval.result(model, "test", "l2")


# prediction and accuracy check
pred_y
= predict(model, test_x)
mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse)

# visualize the result in a plot
df = data.frame(test_y, pred_y)
df$id = 1:nrow(df)

ggplot() + geom_line(data = df, aes(x = id, y = test_y, color = 'test_y')) +
geom_line(data = df, aes(x=id, y = pred_y, color = 'pred_y')) +
ggtitle("Boston housing test data prediction") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab('medv')

# feature importance
tree_imp
= lgb.importance(model, percentage = TRUE)
lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")
 


References:

  1. LightGBM R-package



3 comments:

  1. lgb.Dataset currently only accepts matrix, not data frame. So need to modify:

    dtrain = lgb.Dataset(as.matrix(train_x), label = train_y)
    dtest = lgb.Dataset.create.valid(dtrain, as.matrix(test_x), label = test_y)

    ReplyDelete
  2. Thanks a lot for the detailed explain tutorial!! Is there a typo ?
    # validation data
    valids = list(test = dtest) --> should be valids = list(test = dtrain)

    using dtrain instead of dtest we get better MSE, MAE, RMSE
    MSE: 18.22639
    MAE: 2.647887
    RMSE: 4.269238

    ReplyDelete
  3. Thanks a lot for the tutorial but isn't it a typo in valids ?
    1/ valids = list(test = dtest) -> 2/ valids = list(test = dtrain)
    Shouldn't check training with corresponding results?

    ReplyDelete