In
this tutorial, we'll briefly learn how to fit and predict regression
data by using LightGBM in R. The
tutorial
covers:

- Preparing the data
- Fitting the model and prediction
- Accuracy checking
- Source code listing

We'll start by installing R interface package of LightGBM API and loading the required packages.

` `

`install.packages("lightgbm")`

` `

`library(lightgbm)`

library(caret)

`library(ggplot2) `

` `

**Preparing the data**

We use Boston housing-price dataset as a target regression data in this
tutorial. After loading the dataset, first, we'll split them into the
train and test parts, and extract x-input and y-label parts. Here, we'll
extract 15 percent of the dataset as test data. It is better to scale x part of data to improve the accuracy.

` `

`boston = MASS::Boston`

str(boston)

`dim(boston)`

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)

train = boston[indexes, ]

test = boston[-indexes, ]

train_x = train[, -14]

train_x = scale(train_x)[,]

train_y = train[,14]

test_x = test[, -14]

test_x = scale(test[,-14])[,]

test_y = test[,14]

` `

` `

Next, we'll load the train and test
data into the LightGBM dataset object. Below code shows how to load
train and evaluation test data.

` `

`dtrain = lgb.Dataset(train_x, label = train_y)`

dtest = lgb.Dataset.create.valid(dtrain, test_x, label = test_y)

**Building model and prediction**

First, we'll define regression model parameters and validation data as shown below. You can change values according to your evaluation targets.

`# define parameters`

params = list(

objective = "regression"

, metric = "l2"

, min_data = 1L

, learning_rate = .3

)

` `

`# validataion data`

valids = list(test = dtest)

Now, we can train the model with defined parameters above.

`# train model `

`model = lgb.train(`

params = params

, data = dtrain

, nrounds = 5L

, valids = valids

)

` `

We can check L2 values for test dataset. It show 5 round outputs of L2.

` `

`lgb.get.eval.result(model, "test", "l2")`

`[1] 52.80909 37.78159 31.03807 27.43076 26.19789`

Now, we can predict the x test data with the trained model.

` `

`# prediction`

pred_y = predict(model, test_x)

` `

**Accuracy checking**

We'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

` `

`# accuracy check`

mse = mean((test_y - pred_y)^2)

mae = caret::MAE(test_y, pred_y)

rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse)

` `

`MSE: 26.19789 `

MAE: 3.570257

RMSE: 5.118387

` `

We can also visualize original and predicted test data in a plot.

` `

`# visualize the result in a plot`

df = data.frame(test_y, pred_y)

df$id = 1:nrow(df)

`ggplot() + geom_line(data = df, aes(x = id, y = test_y, color = 'test_y')) + `

geom_line(data = df, aes(x=id, y = pred_y, color = 'pred_y')) +

ggtitle("Boston housing test data prediction") +

theme(plot.title = element_text(hjust = 0.5)) +

ylab('medv')

Finally, we'll find the top 5 important features of training data and visualize it in a graph.

` `

`# feature importance`

tree_imp = lgb.importance(model, percentage = TRUE)

lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")

In this tutorial, we've briefly learned how to fit and predict regression data with LightGBM method in R. The full source code is listed below.

Source code listing

Source code listing

library(lightgbm)

library(caret)

library(ggplot2)

boston = MASS::Boston

dim(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)

train = boston[-indexes, ]

test = boston[indexes, ]

train_x = train[, -14]

train_x = scale(train_x)[,]

train_y = train[,14]

test_x = test[, -14]

test_x = scale(test[,-14])[,]

test_y = test[,14]

dtrain = lgb.Dataset(train_x, label = train_y)

dtest = lgb.Dataset.create.valid(dtrain, test_x, label = test_y)

# define parameters

params = list(

objective = "regression"

, metric = "l2"

, min_data = 1L

, learning_rate = .3

)

# validataion data

valids = list(test = dtest)

# train model

model = lgb.train(

params = params

, data = dtrain

, nrounds = 5L

, valids = valids

)

# Get L2 values for "test" dataset

lgb.get.eval.result(model, "test", "l2")

# prediction and accuracy check

pred_y = predict(model, test_x)

mse = mean((test_y - pred_y)^2)

mae = caret::MAE(test_y, pred_y)

rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse)

# visualize the result in a plot

df = data.frame(test_y, pred_y)

df$id = 1:nrow(df)

ggplot() + geom_line(data = df, aes(x = id, y = test_y, color = 'test_y')) +

geom_line(data = df, aes(x=id, y = pred_y, color = 'pred_y')) +

ggtitle("Boston housing test data prediction") +

theme(plot.title = element_text(hjust = 0.5)) +

ylab('medv')

# feature importance

tree_imp = lgb.importance(model, percentage = TRUE)

lgb.plot.importance(tree_imp, top_n = 5L, measure = "Gain")

**References:**

lgb.Dataset currently only accepts matrix, not data frame. So need to modify:

ReplyDeletedtrain = lgb.Dataset(as.matrix(train_x), label = train_y)

dtest = lgb.Dataset.create.valid(dtrain, as.matrix(test_x), label = test_y)

Thanks a lot for the detailed explain tutorial!! Is there a typo ?

ReplyDelete# validation data

valids = list(test = dtest) --> should be valids = list(test = dtrain)

using dtrain instead of dtest we get better MSE, MAE, RMSE

MSE: 18.22639

MAE: 2.647887

RMSE: 4.269238

Thanks a lot for the tutorial but isn't it a typo in valids ?

ReplyDelete1/ valids = list(test = dtest) -> 2/ valids = list(test = dtrain)

Shouldn't check training with corresponding results?