DataTechNotes: Classification with Decision Trees in Python

Decision trees are hierarchical structures used in machine learning for both classification and regression tasks. The decision tree model is a powerful machine learning method commonly used for classification tasks due to its simplicity and effectiveness in handling both numerical and categorical data.

In this tutorial, we'll explore how to implement classification by using Scikit-learn decision trees model. The tutorial covers:

Understanding Decision Trees
Implementing Decision Trees in Python
Conclusion
Source code listing

Video tutorial

Understanding Decision Trees

The decision tree model is a supervised learning method commonly used for classification and regression tasks in machine learning. It represents a tree-like structure where each internal node represents a feature or attribute, each branch represents a decision based on that feature, and each leaf node represents a predicted outcome.

Decision trees consist of three main types of nodes:

Root Node: The topmost node in the tree, representing the entire dataset before any splitting occurs. It contains a decision based on a feature that best separates the data into subsets.
Decision Nodes: Internal nodes in the tree where decisions are made based on feature values. Each decision node represents a feature and a corresponding threshold (for numerical features) or categories (for categorical features).
Leaf Nodes: Terminal nodes of the tree, representing the final outcomes or predictions for specific subsets of the data. Leaf nodes do not contain any further splits; they simply assign a class label based on the majority class of the samples in the subset.

Decision-Making Process

The decision-making process of a decision tree involves traversing the tree from the root to the leaf nodes based on feature values. At each decision node, the tree evaluates the feature value of the input data and follows the appropriate branch based on the decision rule. This process continues until a leaf node is reached, which provides the final prediction or outcome for the input data. By this method decision trees efficiently divide the feature space and make predictions for new instances.

Implementing Decision Trees in Python

In this part of the tutorial, we implement a decision tree classifier for a classification task using scikit-learn in Python. We'll begin by importing necessary libraries, including the 'DecisionTreeClassifier' class from sklearn.tree module.

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

Next, we load the Iris dataset using load_iris() function from scikit-learn and split the dataset into training and testing sets using train_test_split() function.

 
# Loading the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Then, we instantiate the DecisionTreeClassifier() and train the classifier on the training data using the fit() method.

 
# Instantiating the decision tree classifier
clf = DecisionTreeClassifier()

# Training the classifier on the training data
clf.fit(X_train, y_train)

After training, we make predictions on the testing data using the predict() method. Finally, we calculate the accuracy of the classifier using the accuracy_score() and classification_report() functions from scikit-learn's metrics module and print the result.

 
# Making predictions on the testing data
y_pred = clf.predict(X_test)

# Calculating the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
cr = classification_report(y_test, y_pred)
print("Classifcation report:\n", cr)

The result looks as follows.

  
Accuracy: 1.0
Classifcation report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Conclusion

In this tutorial, we explored the fundamental concept of decision tree model and how to implement it for classification tasks in Python using the Scikit-learn library. We learned the hierarchical structure of decision trees and how they divide the feature space to make predictions. The full source code is listed below.

Source code listing

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Loading the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiating the decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Training the classifier on the training data
clf.fit(X_train, y_train)

# Making predictions on the testing data
y_pred = clf.predict(X_test)

# Calculating the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
cr = classification_report(y_test, y_pred)
print("Classifcation report:\n", cr)

1 comment:

Evan RaymondsJuly 3, 2019 at 12:06 AM
A decision tree is a visual model for decision making which represents consequences, including chance event outcomes, resource costs, and utility. It is also one way to display an algorithm that only contains conditional control statements. Making decision trees are super easy with a decision tree maker with free templates.

Pages

Classification with Decision Trees in Python

1 comment: