Feature Selection Example with RFECV in Python

      RFECV (Recursive Feature Elimination with Cross-Validation) performs recursive feature elimination with cross-validation loop to extract the optimal features. Scikit-learn provides RFECV class to implement RFECV method to find the most important features in a given dataset.

    Selecting optimal features is important part of data preparation in machine learning. It helps us to eliminate less important part of the data and reduce a training time in large datasets.

    In this tutorial, we'll briefly learn how to select best features of classification and regression data by using the RFECV in Python. The tutorial covers:
  1. RFECV for classification data
  2. RFECV for regression data
  3. Source code listing
   We'll start by loading the required libraries and functions.

 
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array  
 


RFECV for classification

   First, we'll apply the RFECV for classification dataset.  We'll load the dataset and take feature and label parts of iris data.

 
iris = load_iris()
x = iris.data
y = iris.target
 

RFECV requires estimator model. Here we can use Random Forest Classifier class as an estimator model. Then we'll define RFECV and fit it on training x and y data. Ranking property gives us ranking position of a each feature. Optimal features are labeled rank 1.


rfc = RandomForestClassifier()

select = RFECV(estimator=rfc, cv=10)
select = select.fit(x,y)
 
print("Feature ranking: ", select.ranking_)
  
Feature ranking:  [2 3 1 1]


Next, we'll extract the selected features.  get_support() function helps us to get those features names

 
mask = select.get_support()
features = array(iris.feature_names) 
best_features = features[mask]
 
print("All features: ", x.shape[1])
print(features)

print("Selected best: ", best_features.shape[0])
print(features[mask]) 

All features: 4
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
'petal width (cm)']
Selected best:2
['petal length (cm)' 'petal width (cm)']
  


RFECV for regression data

   We apply the same method for regression data. We'll load the Boston housing dataset and take feature and label parts of the data. 

 
boston = load_boston()
x = boston.data
y = boston.target


Next, we'll define the estimator model and apply it into RFECV class. Then we can fit the model on training x and y data. Ranking property gives us ranking position of a each feature. Optimal features are labeled rank 1.

 
rfr = RandomForestRegressor()
select = RFECV(rfr, step=1, cv=5)
select = select.fit(x, y)
 
print("Feature ranking: ", select.ranking_)

Feature ranking:  [1 2 1 3 1 1 1 1 1 1 1 1 1]
  

Next, we'll extract the selected features.  get_support() function helps us to get those features names. 

 
mask = select.get_support()
features = array(boston.feature_names)
best_features = features[mask]

print("All features: ", x.shape[1])
print(features)

print("Selected best: ", best_features.shape[0])
print(features[mask])

All features:  13
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected best: 11
['CRIM' 'INDUS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] 
 
 
 
   In this tutorial, we've briefly learned how to select optimal features of classification and regression data by using RFECV model in Python. The full source code is listed below.


Source code listing
 
 
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array 


# RFECV for classification 
iris = load_iris()
x = iris.data
y = iris.target 

rfc = RandomForestClassifier()
select = RFECV(estimator=rfc, cv=5)
select = select.fit(x,y)
print("Feature ranking: ", select.ranking_)

mask = select.get_support()
features = array(iris.feature_names) 
best_features = features[mask]

print("All features: ", x.shape[1])
print(features)

print("Selected best: ", best_features.shape[0])
print(features[mask])


# RFECV for regression 
boston = load_boston()
x = boston.data
y = boston.target

rfr = RandomForestRegressor()
select = RFECV(rfr, step=1, cv=5)
select = select.fit(x, y)
print("Feature ranking: ", select.ranking_)

mask = select.get_support()
features = array(boston.feature_names)
best_features = features[mask]

print("All features: ", x.shape[1])
print(features)

print("Selected best: ", best_features.shape[0])
print(features[mask])   
 


References:

1 comment: