Regression Example with an Extra-Trees Method in Python

   Extremely Randomized Trees (or Extra-Trees) is an ensemble learning method. The method creates extra trees randomly in sub-samples of datasets to improve the predictivity of the model. By this approach, the method reduces the variance. The method averages the outputs from the decision trees.
 
   In this tutorial, we'll briefly learn how to fit and predict regression data by using Scikit-learn's ExtraTreesRegressor class in Python. The tutorial covers:
  1. Preparing the data
  2. Training the model
  3. Predicting and accuracy check
  4. Source code listing
   We'll start by loading the required libraries.

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt


Preparing the data

   In this tutorial, we'll use the Boston housing dataset as target regression data to predict. First, we'll load the dataset and define the x and y parts.

boston = load_boston()
x, y = boston.data, boston.target

Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)


Training the model

   Next, we'll define the regressor by using the ExtraTreesRegressor class. You can set some of the arguments for this class. In this example, we can use the class with default parameters.

etr = ExtraTreesRegressor()
print(etr)

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, n_estimators='warn',
                    n_jobs=None, oob_score=False, random_state=None, verbose=0,
                    warm_start=False)

Then, we'll fit the model on train data and check the model accuracy score.

etr.fit(xtrain,ytrain)

score = etr.score(xtrain, ytrain)
print("Score: ", score)

Score:  1.0

We can also apply a cross-validation training method to the model and check the accuracy. 

cv_scores = cross_val_score(etr, xtrain,ytrain,cv=10)
print("Mean cross-validataion score: %.2f" % cv_scores.mean())

Mean cross-validataion score: 0.84


Predicting and accuracy check

Now, we can predict the test data by using the trained model. After the prediction, we'll check the accuracy level by using the MSE and RMSE metrics.

ypred = etr.predict(xtest)

mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % mse**(0.5))

MSE: 8.25
RMSE: 2.87

Finally, we'll visualize the test and predicted data in a plot to check the difference visually.


x_ax = range(len(ytest))
plt.plot(x_ax, ytest, lw=0.6, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.title("Boston target test and predicted data")
plt.legend()
plt.show()



   In this tutorial, we've briefly learned how to fit and predict regression data by using Scikit-learn API's ExtraTreesRegressor class in Python. The full source code is listed below.


Source code listing

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.15)

etr = ExtraTreesRegressor()
print(etr)

etr.fit(xtrain,ytrain)
score = etr.score(xtrain, ytrain)
print("Score: ", score)

cv_scores = cross_val_score(etr, xtrain, ytrain, cv = 10)
print("Mean cross-validataion score: %.2f" % cv_scores.mean())

ypred = etr.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % mse**(0.5))

x_ax = range(len(ytest))
plt.plot(x_ax, ytest, lw=0.6, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.title("Boston target test and predicted data")
plt.legend()
plt.show()


References:

2 comments:

  1. Hi there! Thanks for the information, it's really helpful for begginers like me.
    Keep it up with the good work!

    ReplyDelete