The tutorial covers:

- Preparing data
- Defining and fitting the model
- Predicting and checking the results

import xgboost as xgb from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, KFold from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt

`import numpy as np `

**Preparing data**

We use Boston house-price dataset as regression dataset in this tutorial. After loading the dataset, first, we'll separate data into x - feature and y - label. Then we'll split them into the train and test parts. Here, I'll extract 15 percent of the dataset as test data.

boston = load_boston() x, y = boston.data, boston.target xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

**Defining and fitting the model**

For the regression problem, we'll use XGBRegressor class of the xgboost package and we can define it with its default parameters. You can also set the new parameter values according to your data characteristics.

xgbr = xgb.XGBRegressor()

`print(xgbr)`

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

Next, we'll fit the model with train data.

`xgbr.fit(xtrain, ytrain)`

**Predicting and checking the results**

After training the model, we'll check the model accuracy with cross-validation method.

scores = cross_val_score(xgbr, xtrain,ytrain,cv=5) print("Mean cross-validation score: %.2f" % scores.mean())

Mean cross-validataion score: 0.87

Cross-validation with a k-fold method can be checked as a following.

kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbr, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

K-fold CV average score: 0.87

Both methods show that the model is around 88 % accurate on average.

Next, we can predict test data and check its accuracy. Here, we'll use MSE and RMSE as accuracy metrics.

ypred = xgbr.predict(xtest) mse = mean_squared_error(ytest,ypred) print("MSE: %.2f" % mse)

MSE: 3.35

print("RMSE: %.2f" % np.sqrt(mse))

RMSE: 1.83

Finally, we'll visualize the original and predicted test data in a plot.

x_ax = range(len(ytest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()

In this post, we've briefly learned how to use XGBRegressor to predict regression data in Python. Thank you for reading.

The full source code is listed below.

import xgboost as xgb from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, KFold from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt

import numpy as np

` `

boston = load_boston() x, y = boston.data, boston.target xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15) xgbr = xgb.XGBRegressor() print(xgbr) xgbr.fit(xtrain, ytrain)

# - cross validataion scores = cross_val_score(xgbr, xtrain, ytrain, cv=5) print("Mean cross-validation score: %.2f" % scores.mean()) kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbr, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

ypred = xgbr.predict(xtest) mse = mean_squared_error(ytest, ypred) print("MSE: %.2f" % mse) print("RMSE: %.2f" % np.sqrt(mse)) x_ax = range(len(ytest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()

Hello,

ReplyDeleteI've a couple of question.

1. What are labels for x and y axis in the above graph?

2. Then I’m trying to understand the following example.

I’m confused about the first piece of code. It seems to me that cross-validation and Cross-validation with a k-fold method are performing the same actions. In the second example just 10 times more. The result is the same. I dont understand the cross-validation in first example what is for?

Thanks,

Marco

Hi,

ReplyDelete1. The plot describes 'medv' column of boston dataset (original and predicted). x label is the number of sample and y label is the value of 'medv'

2. They explain two ways of implementaion of cross-validation. You can use one of them.

how can write python code to upload similar work done like this in order to submit on kaggle.com. Thanks

ReplyDelete