Classification Example with an Extra-Trees Method in Python

   Extremely Randomized Trees (or Extra-Trees) is an ensemble learning method. The method creates extra trees in sub-samples of datasets and applies majority voting to improve the predictivity of the classifier. By this approach, the method reduces the variance. The method applies a random thresholds for each features of sub-samples to obtain the best of the thresholds as a splitting rule.
 
   In this tutorial, we'll briefly learn how to classify data by using Scikit-learn's ExtraTreesClassifier class in Python. The tutorial covers:
  1. Preparing the data
  2. Training the model
  3. Predicting and accuracy check
  4. Source code listing
  5. Video tutorial
   We'll start by loading the required libraries.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix


Preparing the data

   In this tutorial, we'll use the Iris dataset as target data to classify. We'll define the x and y data parts.

iris = load_iris()
x, y = iris.data, iris.target

Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)


Training the model

   Next, we'll define the classifier by using the ExtraTreesClassifier class. We can set the estimator number, here I'll set 100 to the estimator's number.

clf = ExtraTreesClassifier(n_estimators=100)
print(clf)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False) 

Then, we'll fit the model on train data and check the model accuracy score.

clf.fit(xtrain, ytrain)

score = clf.score(xtrain, ytrain) print("Score: ", score) Score: 1.0

We can also apply a cross-validation method to the model and check the training accuracy. 

cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 )
print("CV average score: %.2f" % cv_scores.mean())

CV average score: 0.96


Predicting and accuracy check

Now, we can predict the test data by using the trained model. After the prediction, we'll check the accuracy level by using the confusion matrix function.

ypred = clf.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm)

[[5 0 0]
 [0 4 0]
 [0 0 6]]

   In this tutorial, we've briefly learned how to classify data by using Scikit-learn API's ExtraTreesClassifier class in Python. The full source code is listed below.


Source code listing

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix

iris = load_iris()
x, y = iris.data, iris.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

clf = ExtraTreesClassifier(n_estimators=100)
print(clf)

clf.fit(xtrain, ytrain)
score = clf.score(xtrain, ytrain)
print("Score: ", score)

cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 )
print("CV average score: %.2f" % cv_scores.mean())

ypred = clf.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm)


Video tutorial




References:

No comments:

Post a Comment