Classification with sklearn Decision Trees Classifier

  The decision trees model is a supervised learning method used to solve classification and regression problems in machine learning. It is a tree-like, top-down flow structure based on multiple if-else  learning rules. Every if-else decision creates a branch based on certain decision outcomes.
   In this post, we'll learn how to create a decision tree model with 'sklearn' package to classify dataset in Python. The tutorial covers:
  1. Preparing data
  2. Training Decision Tree Classifier
  3. Evaluating the result
We'll start by loading the required packages.

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,confusion_matrix,\
 classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


Preparing data

First, we generate a random dataset for this tutorial. Here, we create a data frame and separate it into the feature - X  and label - Y parts. Then, we split X, Y data into the train and test parts.

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(500)

X = df[["a","b","c"]]
Y = df[["y"]]
XTrain, XTest, YTrain, YTest = train_test_split(X, Y, random_state=0)
 
>>> df.head(10)
   a   b  c       y
0  4   1  2     low
1  7  19  4    high
2  4   2  1     low
3  5   6  3  normal
4  4  16  1  normal
5  9   6  0  normal
6  1   6  4     low
7  5  10  2  normal
8  7   7  1  normal
9  4  15  2  normal 


Training Decision Tree Classifier

We use DecisionTreeClassifier() of a 'sklearn.tree' package to create a decision tree classifier. Then train the model with XTrain and YTrain data.

dtmodel = DecisionTreeClassifier().fit(XTrain,YTrain)
>>> dtmodel
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best') 

Next, we predict our test data.

YPred = dtmodel.predict(XTest) 


Evaluating the results

Finally, we can check the classification results.

accuracy = accuracy_score(YTest,YPred)
report = classification_report(YPred, YTest)
cm = confusion_matrix(YTest, YPred)

print("Classification report:")
print("Accuracy: ", accuracy)
print(report)
print("Confusion matrix:")
print(cm)
 
Classification report:
Accuracy:  0.96
             precision    recall  f1-score   support

       high       0.67      1.00      0.80         6
        low       1.00      0.94      0.97        36
     normal       0.98      0.96      0.97        83

avg / total       0.97      0.96      0.96       125

Confusion matrix:
[[ 6  0  3]
 [ 0 34  0]
 [ 0  2 80]]

   In this post, we've learned how to use sklearn DecisionTreeClassifier to classify dataset.
Thank you for reading! The full source code is listed below.

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,confusion_matrix,\
 classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(500)

X = df[["a","b","c"]]
Y = df[["y"]]
XTrain, XTest, YTrain, YTest = train_test_split(X, Y, random_state=0)

dtmodel = DecisionTreeClassifier().fit(XTrain,YTrain)
YPred = dtmodel.predict(XTest)

accuracy = accuracy_score(YTest,YPred)
report = classification_report(YPred, YTest)
cm = confusion_matrix(YTest, YPred)

print("Classification report:")
print("Accuracy: ", accuracy)
print(report)
print("Confusion matrix:")
print(cm)

1 comment:

  1. A decision tree is a visual model for decision making which represents consequences, including chance event outcomes, resource costs, and utility. It is also one way to display an algorithm that only contains conditional control statements. Making decision trees are super easy with a decision tree maker with free templates.

    ReplyDelete