DataTechNotes: Regression Example with an Extra-Trees Method in Python

Extremely Randomized Trees (or Extra-Trees) is an ensemble learning method. The method creates extra trees randomly in sub-samples of datasets to improve the predictivity of the model. By this approach, the method reduces the variance. The method averages the outputs from the decision trees.

In this tutorial, we'll briefly learn how to fit and predict regression data by using Scikit-learn's ExtraTreesRegressor class in Python. The tutorial covers:

Preparing the data
Training the model
Predicting and accuracy check
Source code listing

We'll start by loading the required libraries.

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

Preparing the data

In this tutorial, we'll use the Boston housing dataset as target regression data to predict. First, we'll load the dataset and define the x and y parts.

boston = load_boston()
x, y = boston.data, boston.target

Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

Training the model

Next, we'll define the regressor by using the ExtraTreesRegressor class. You can set some of the arguments for this class. In this example, we can use the class with default parameters.

etr = ExtraTreesRegressor()
print(etr)

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, n_estimators='warn',
                    n_jobs=None, oob_score=False, random_state=None, verbose=0,
                    warm_start=False)

Then, we'll fit the model on train data and check the model accuracy score.

etr.fit(xtrain,ytrain)

score = etr.score(xtrain, ytrain)
print("Score: ", score)

Score:  1.0

We can also apply a cross-validation training method to the model and check the accuracy.

cv_scores = cross_val_score(etr, xtrain,ytrain,cv=10)
print("Mean cross-validataion score: %.2f" % cv_scores.mean())

Mean cross-validataion score: 0.84

Predicting and accuracy check

Now, we can predict the test data by using the trained model. After the prediction, we'll check the accuracy level by using the MSE and RMSE metrics.

ypred = etr.predict(xtest)

mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % mse**(0.5))

MSE: 8.25
RMSE: 2.87

Finally, we'll visualize the test and predicted data in a plot to check the difference visually.

x_ax = range(len(ytest))
plt.plot(x_ax, ytest, lw=0.6, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.title("Boston target test and predicted data")
plt.legend()
plt.show()

In this tutorial, we've briefly learned how to fit and predict regression data by using Scikit-learn API's ExtraTreesRegressor class in Python. The full source code is listed below.

Source code listing

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.15)

etr = ExtraTreesRegressor()
print(etr)

etr.fit(xtrain,ytrain)
score = etr.score(xtrain, ytrain)
print("Score: ", score)

cv_scores = cross_val_score(etr, xtrain, ytrain, cv = 10)
print("Mean cross-validataion score: %.2f" % cv_scores.mean())

ypred = etr.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % mse**(0.5))

x_ax = range(len(ytest))
plt.plot(x_ax, ytest, lw=0.6, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.title("Boston target test and predicted data")
plt.legend()
plt.show()

References:

Scikit learn API

DataTechNotes

Pages

Regression Example with an Extra-Trees Method in Python

2 comments: