In this post, we'll learn how to create a decision tree model with 'sklearn' package to classify dataset in Python. The tutorial covers:
- Preparing data
- Training Decision Tree Classifier
- Evaluating the result
import pandas as pd import numpy as np from sklearn.metrics import accuracy_score,confusion_matrix,\ classification_report from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier
Preparing data
First, we generate a random dataset for this tutorial. Here, we create a data frame and separate it into the feature - X and label - Y parts. Then, we split X, Y data into the train and test parts.
def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(500) X = df[["a","b","c"]] Y = df[["y"]] XTrain, XTest, YTrain, YTest = train_test_split(X, Y, random_state=0)
>>> df.head(10) a b c y 0 4 1 2 low 1 7 19 4 high 2 4 2 1 low 3 5 6 3 normal 4 4 16 1 normal 5 9 6 0 normal 6 1 6 4 low 7 5 10 2 normal 8 7 7 1 normal 9 4 15 2 normal
Training Decision Tree Classifier
We use DecisionTreeClassifier() of a 'sklearn.tree' package to create a decision tree classifier. Then train the model with XTrain and YTrain data.
dtmodel = DecisionTreeClassifier().fit(XTrain,YTrain)
>>> dtmodel DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
Next, we predict our test data.
YPred = dtmodel.predict(XTest)
Evaluating the results
Finally, we can check the classification results.
accuracy = accuracy_score(YTest,YPred) report = classification_report(YPred, YTest) cm = confusion_matrix(YTest, YPred) print("Classification report:") print("Accuracy: ", accuracy) print(report) print("Confusion matrix:") print(cm)
Classification report: Accuracy: 0.96 precision recall f1-score support high 0.67 1.00 0.80 6 low 1.00 0.94 0.97 36 normal 0.98 0.96 0.97 83 avg / total 0.97 0.96 0.96 125 Confusion matrix: [[ 6 0 3] [ 0 34 0] [ 0 2 80]]
In this post, we've learned how to use sklearn DecisionTreeClassifier to classify dataset.
Thank you for reading! The full source code is listed below.
import pandas as pd import numpy as np from sklearn.metrics import accuracy_score,confusion_matrix,\ classification_report from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(500) X = df[["a","b","c"]] Y = df[["y"]] XTrain, XTest, YTrain, YTest = train_test_split(X, Y, random_state=0) dtmodel = DecisionTreeClassifier().fit(XTrain,YTrain) YPred = dtmodel.predict(XTest) accuracy = accuracy_score(YTest,YPred) report = classification_report(YPred, YTest) cm = confusion_matrix(YTest, YPred) print("Classification report:") print("Accuracy: ", accuracy) print(report) print("Confusion matrix:") print(cm)
A decision tree is a visual model for decision making which represents consequences, including chance event outcomes, resource costs, and utility. It is also one way to display an algorithm that only contains conditional control statements. Making decision trees are super easy with a decision tree maker with free templates.
ReplyDelete