- Preparing data
- Training bagging classifier
- Predicting test data and checking the accuracy
- Checking accuracy by changing base estimator
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB, BernoulliNB from sklearn.linear_model import LogisticRegression
Preparing data
In this tutorial, I'll generate dataset by random numbers with the below function. You can use any classification dataset such as iris, etc. First, we separate X - features of dataset and Y - output parts.
def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(200) df.head()
a b c y 0 4 13 0 normal 1 9 17 2 high 2 5 3 0 low 3 7 1 4 normal 4 3 6 3 normal
As it is seen, Y - output is a categorical data type. It needs to be encoded, and we set numeric codes with LabelEncoder() function.
Y.head() y 0 high 1 normal 2 normal 3 normal 4 low
le = LabelEncoder() y = le.fit_transform(Y)
print(y[0:5]) [0 2 2 2 1]
Next, we create a train and test parts with train_test_split function.
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
Training bagging classifier
We use a BaggingClassifier class of 'sklearn.ensemble' packages to build bagging classifier model. Here, we set DecisionTreeClassifier class as a base estimator and set 100 to the number of estimators, then train the model with train data.
dtc = DecisionTreeClassifier(criterion="entropy") bag_model=BaggingClassifier(base_estimator=dtc, n_estimators=100, bootstrap=True) bag_model=bag_model.fit(Xtrain,ytrain)
Next, we predict data and check the prediction accuracy.
ytest_pred=bag_model.predict(Xtest)
print(bag_model.score(Xtest, ytest))
0.9
print(confusion_matrix(ytest, ytest_pred))
[[ 4 0 1] [ 0 10 1] [ 1 2 31]]
The model has predicted test data with 90% accuracy.
Checking accuracy by changing base estimator
We can change base estimator in BaggingClassifier class. Here, we'll use Logistic regression, Naive Base (Gaussian, Bernoulli) methods as a base estimator and check their prediction accuracy.
lr = LogisticRegression(); bnb = BernoulliNB() gnb = GaussianNB() base_methods=[lr, bnb, gnb, dtc] for bm in base_methods: print("Method: ", bm) bag_model=BaggingClassifier(base_estimator=bm,n_estimators=100,bootstrap=True) bag_model=bag_model.fit(Xtrain,ytrain) ytest_pred=bag_model.predict(Xtest) print(bag_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred))
Method: LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) 0.9 [[ 0 0 5] [ 0 11 0] [ 0 0 34]] Method: BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True) 0.74 [[ 0 0 5] [ 0 3 8] [ 0 0 34]] Method: GaussianNB(priors=None) 0.82 [[ 1 0 4] [ 0 11 0] [ 1 4 29]] Method: DecisionTreeClassifier(class_weight=None,criterion='entropy',max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') 0.92 [[ 4 0 1] [ 0 10 1] [ 1 1 32]]
You can check each methods prediction accuracy in the above results.
In this post, we've briefly learned how to classify data with BaggingClassifier class in Python. Thank you for reading! The full source code is listed below.
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.naive_bayes import GaussianNB, BernoulliNB from sklearn.linear_model import LogisticRegression def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(200) df.head() X = df[["a","b","c"]] Y = df[["y"]] le=LabelEncoder() y=le.fit_transform(Y) Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0) dtc = DecisionTreeClassifier(criterion="entropy") bag_model=BaggingClassifier(base_estimator=dtc, n_estimators=100, bootstrap=True) bag_model=bag_model.fit(Xtrain,ytrain) ytest_pred=bag_model.predict(Xtest) print(bag_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred)) lr = LogisticRegression(); bnb = BernoulliNB() gnb = GaussianNB() base_methods=[lr, bnb, gnb, dtc] for bm in base_methods: print("Method: ", bm) bag_model=BaggingClassifier(base_estimator=bm, n_estimators=100, bootstrap=True) bag_model=bag_model.fit(Xtrain,ytrain) ytest_pred=bag_model.predict(Xtest) print(bag_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred))
Reference:
1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
Hi dude, this post is one of the most simple and explanatory I could find, it helped me a lot.
ReplyDeleteOnly in the part of:
Y = df[["y"]]
I changed:
Y = df[["y"][0]]
So that Python doesn't show a message on the terminal.
Thank you!
Delete