Classification with Gaussian Naive Bayes model in Python

     Naive Bayes is a classification algorithm based on Bayes' Theorem, a fundamental principle in probability theory. It works by calculating the probability that a given input belongs to each possible class, then selecting the class with the highest probability as the predicted outcome. In this tutorial, we'll explore the Naive Bayes model and its practical application using the Scikit-learn library in Python. We'll cover the following topics:

  1. Introduction to Naive Bayes
  2. Preparing data
  3. Training model
  4. Prediction and accuracy check
  5. Conclusion
  6. Source code listing

 

Introduction to Naive Bayes

    Naive Bayes is a classification algorithm based on Bayes' theorem, with the "naive" assumption that features are independent of each other given the class label. It's a simple and effective probabilistic model used for classification tasks. There are several Naive Bayes methods, including:

  • Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian distribution.
  • Multinomial Naive Bayes: Suitable for features representing counts or frequencies.
  • Bernoulli Naive Bayes: Applicable when features are binary (presence or absence).
  • Categorical Naive Bayes: Designed for features that are categorical (non-binary).

 

Bayes' Theorem 

    Bayes' theorem calculates the probability of a hypothesis (class label) given the data, based on prior knowledge. Mathematically, it's represented as:

P(yx)=P(xy)P(y)P(x)

  • P(yx): Probability of class y given the input features x (posterior).
  • P(xy): Probability of observing the features x given class y (likelihood).
  • P(y): Probability of class y occurring (prior).
  • P(x): Probability of observing features x (evidence).

    Naive Bayes assumes that the features are conditionally independent given the class label. This means that the presence of a particular feature in a class is unrelated to the presence of any other feature.

    Naive Bayes requires estimating parameters such as means and variances for Gaussian Naive Bayes, and probabilities for multinomial and Bernoulli Naive Bayes.

 

Classification 

    To classify a new instance, Naive Bayes calculates the posterior probability P(yx) for each class label and selects the class with the highest probability.

    Naive Bayes is known for its simplicity, speed, and scalability. It performs well in many real-world applications, especially when the naive assumption holds true or when there are high-dimensional feature spaces.

 

Preparing data

    We'll start loading the necessary libraries for this tutorial. Make sure you have the sklearn library installed.

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

    Next, we load the Iris dataset available in Scikit-Learn and split the dataset into training and testing sets using the train_test_split function from Scikit-Learn.


# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

Training model

    We create a Gaussian Naive Bayes classifier using the GaussianNB class and  train the classifier on the training data using the fit method.

 
# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the classifier on the training data
gnb.fit(X_train, y_train)


Prediction and accuracy check

    We use the trained classifier to make predictions on the test data X_test. The predict() method is applied to the model object with the test features as input, resulting in predicted class labels y_pred.
    We compute the accuracy of the model predictions by comparing the predicted class labels with the actual class labels from the test set. The accuracy_score() function from scikit-learn is used to calculate the accuracy.
    The classification report includes metrics such as precision, recall, F1-score, and support for each class.

 
# Make predictions on the testing data
y_pred = gnb.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

The result appears as follows:


 Accuracy: 1.0
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

 

Conclusion

    This tutorial has provided an overview of Naive Bayes classification, explaining how to split the dataset into training and testing sets, train the classifier, make predictions, and evaluate its performance using accuracy metrics. Full source code is listed below.


Source code listing

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the classifier on the training data
gnb.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = gnb.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))




No comments:

Post a Comment