SelectKBest Feature Selection Example in Python

     Scikit-learn API provides SelectKBest class for extracting best features of given dataset. The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training. It helps us to eliminate less important part of the data and reduce a training time.

    In this tutorial, we'll briefly learn how to select best features of classification and regression data by using the SelectKBest in Python. The tutorial covers:

  1. SelectKBest for classification data
  2. SelectKBest for regression data
  3. Source code listing
   We'll start by loading the required libraries and functions.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array 
 


SelectKBest for classification

   First, we'll apply the SelectKBest model to classification data, Iris dataset.  We'll load the dataset and check the feature data dimension. The 'data' property of the iris object is considered feature data. 

iris = load_iris()
x = iris.data
y = iris.target
 
print("Feature data dimension: ", x.shape) 
 
Feature data dimension:  (150, 4) 

Next, we'll define the model by using SelectKBest class. For classification we'll set 'chi2'  method as a scoring function. The target number of features is defined by k parameter. Then we'll fit and transform method on training x and y data.

select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
 
print("After selecting best 3 features:", z.shape) 
 
After selecting best 3 features: (150, 3) 

We've selected 3 best features in x data. To identify the selected features we use get_support() function and filter out them from the features name list.  The z object contains selected x data. 

filter = select.get_support()
features = array(iris.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 3:")
print(features[filter])
print(z) 


All features:
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)' 'petal width (cm)']
Selected best 3:
['sepal length (cm)' 'petal length (cm)' 'petal width (cm)'] 
 


SelectKBest for regression data

   We apply the same method for regression data only changing scoring function. We'll load the Boston housing data set and check the feature data dimensions.

boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)
 
Feature data dimension:  (506, 13) 
 

Next, we'll define the model by using SelectKBest class. For regression, we'll set 'f_regression'  method as a scoring function. The target number of features to select is 8. We'll fit and transform the model on training x and y data.

select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y) 
 
print("After selecting best 8 features:", z.shape) 
 
After selecting best 8 features: (506, 8) 

To identify the selected features we can use get_support() function and filter out them from the features list. The z object contains selected x data. 

filter = select.get_support()
features = array(boston.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 8:")
print(features[filter])
print(z) 

All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected best 8:
['CRIM' 'INDUS' 'NOX' 'RM' 'RAD' 'TAX' 'PTRATIO' 'LSTAT'] 
 
 
   In this tutorial, we've briefly learned how to get k best features in classification and regression data by using SelectKBest model in Python. The full source code is listed below.
Source code listing
 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array 
 
 
iris = load_iris()
x = iris.data
y = iris.target
 
print("Feature data dimension: ", x.shape)

select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
print("After selecting best 3 features:", z.shape)

filter = select.get_support()
features = array(iris.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 3:")
print(features[filter])
print(z)


boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y)
print("After selecting best 8 features:", z.shape)

filter = select.get_support()
features = array(boston.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 8:")
print(features[filter])
print(z)   
 


References:

5 comments:

  1. So you randomly decided that the number of features to select for regression is 8 out of 13? Why 8? Why not 7 or 9 or 5?? Doesn't seem very scientific.

    ReplyDelete
    Replies
    1. facts, is there process to choosing the k value?

      Delete