Filter-Based Feature Selection

Article
03/20/2015

Identifies the features in a dataset with the greatest predictive power

Module Overview

You can use the Filter-Based Feature Selection module to identify the subset of input columns that have the greatest predictive power. In general, feature selection refers to the process of applying statistical tests to input values given a specified output, to determine which columns are more correlated with the output. The Filter-Based Feature Selection module provides multiple feature selection algorithms, which you apply based on the type of predictive task and data types.

The module requires as input a dataset that contains two or more feature columns.

You then choose a statistical method to apply. Each has different requirements: some require numeric data, whereas others can work with categorical data as well.

The module has two outputs: the first is a dataset containing the top features (columns) as ranked by predictive power. The second output is a transformed dataset containing the numeric scores assigned to the selected columns.

Understanding Filter-Based Feature Selection

Feature selection is the process of selecting those attributes in your dataset that are most relevant to the predictive modeling problem you are working on. By choosing the right features, you can potentially improve the accuracy and efficiency of classification.

Feature selection can also be used to identify unneeded, irrelevant and redundant attributes in the dataset. By applying statistical measures, you can determine which columns do not contribute to the accuracy of the predictive model (or might in fact decrease the accuracy of the model) and remove them before training a model

Filter-Based Feature Selection uses different statistical tests to determine the subset of features with the highest predictive power. You choose a statistical measure to apply, and the module calculates a score for each column that you have used as a feature. The features are then ranked by the score and the feature columns with the best scores are used in building the model, while others are kept in the dataset but not used for analysis.

How to Use Feature Selection

To use feature selection, you must choose an input dataset that contains at least two columns that are candidates for use as features. The columns that you can analyze depend on the target column and the metric used to compute the scores.

Target column
For all methods except count-based feature selection, you must specify the single column that serves as the label, or target, for the dataset. Click Launch column selector to choose the target column either by name or by its index (indexes are one-based). The module will return an error on execution if you choose a column with the wrong data type, choose no column or too many columns, or choose a column that cannot be a label.

Feature scoring method
Next, you choose the statistical method to use in calculating feature scores. For detailed information about these scores, see the Technical Notes section.
- Pearson Correlation
- Mutual Information
- Kendall Correlation
- Spearman Correlation
- Chi Squared
- Fisher Score
- Count Based
The choice of feature selection scoring method to use depends in part on the type of data you have. For example, some methods require numeric data; others can work with data that represents ranking. If you apply a scoring method to a column of data type not supported by the method, a zero score is assigned. Check the requirements in the Technical Notes section before choosing a method.

Number of desired features
For almost all methods, you can specify the number of best features that you want returned. Each method scores all input columns, ranks the features by descending score, and returns only the top features.

The exception is count-based feature selection, which by default processes all columns passed as inputs.
- The minimum number of features you can specify is 1, but we recommend that you increase this value.
- If the specified number of desired features is greater than the number of columns in the dataset, then all features are returned.

Operate on feature columns only
When this option is selected, the method generates a score only for columns that have been previously marked as features. If you deselect this option, the module will check any column that has an appropriate data type.

If a column you want to use is not marked as a feature, you can use Metadata Editor to mark it as a feature column.

A feature selection score cannot be generated for any column that is designated as a label or a score column.

If you want to define a custom feature selection method, you can use the Execute R Script module.

Results

The module calculates some number of features given the selected input columns and parameters, and generates these outputs:

The first output is a dataset containing the columns that were identified as being the best features, meaning they had the highest predictive scores using the selected metric.

This dataset also includes the selected target column, in the leftmost column of the output table. This way you can tell which target the input columns were correlated with.

The columns are ordered by descending feature importance score.
The second output is a short table containing just the scores for those columns, given the selected metric and parameters.

This output dataset does not include label or score columns.
If you select Count Based as the feature selection method, the outputs are a bit different. It generates a score for every column in the dataset, and returns them in their original order.

Examples

You can see examples of how this module is used by exploring these sample experiments in the Model Gallery:

The Breast Cancer sample uses Pearson correlation to find the best 15 features.

Technical Notes

If you attempt to use a scoring method with a column of a data type not supported by the method, either the module will raise an error, or a zero score will be assigned to the column.
If a column contains logical (true/false) values, they are processed as True = 1 and False = 0.
To ensure that a column should be scored as a feature, use the Metadata Editor module to set the IsFeature attribute.
A column cannot be a feature if it has been designated as a Label or a Score.

Missing Values

You cannot specify as a target (label) column any column that has all missing values.
If a column contains missing values, they are ignored when computing the score for the column.
If a column designated as a feature column has all missing values, a zero score is assigned.

Requirements

The following scoring methods accept only numeric and logical data columns:

Pearson Correlation
Kendall Correlation
Spearman Correlation
Fisher Score (the restriction does not apply to the target column)
Count-Based

Details of the Feature Selection Methods

Filter -Based Feature Selection provides a selection of widely used statistical tests for determining the subset of input columns that have the greatest predictive power.

Pearson Correlation
Pearson’s correlation statistics or Pearson’s correlation coefficient is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation.

Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. The coefficient is not affected by changes of scale in the two variables.

Mutual Information
The Mutual Information Score method measures the contribution of a variable towards reducing uncertainty about the value of another variable — in this case, the label. Many variations of the mutual information score have been devised to suit different distributions.

The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.

Kendall Correlation
Kendall's rank correlation is one of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable. In other words, it measures the similarity of orderings when ranked by the quantities. Both this coefficient and Spearman’s correlation coefficient are designed for use with non-parametric and non-normally distributed data.

Spearman Correlation
Spearman's coefficient is a nonparametric measure of statistical dependence between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s coefficient expresses the degree to which two variables are monotonically related. It is also called Spearman rank correlation, because it can be used with ordinal variables.

Chi-Squared
The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. The method assumes that variables are random and drawn from an adequate sample of independent variables. The resulting chi-squared statistic indicates how far results are from the expected (random) result.

Fisher Score
The Fisher score (also called the Fisher method, or Fisher combined probability score) is sometimes termed the information score, because it represents the amount of information that one variable provides about some unknown parameter on which it depends.

The score is computed by measuring the variance between the expected value of the information and the observed value. When variance is minimized, information is maximized. Since the expectation of the score is zero, the Fisher information is also the variance of the score.

Count-Based
Count-based feature selection is a simple yet relatively powerful way of finding information about predictors. It is a non-supervised method of feature selection, meaning you don't need a label column. This method counts the frequencies of all values and then assigns a score to the column based on frequency count. It can be used to find the weight of information in a particular feature and reduce the dimensionality of the data without losing information.

For more information, see Data Transformation / Learning with Counts.

Expected Inputs

Name	Type	Description
Dataset	Data Table	Input dataset

Module Parameters

Name	Range	Type	Default	Description
Feature scoring method	List	Scoring method		Choose the method to use for scoring
Operate on feature columns only	Any	Boolean	true	Indicate whether to use only feature columns in the scoring process
Target column	Any	ColumnSelection	None	Specify the target column
Number of desired features	>=1	Integer	1	Specify the number of features to output in results
Minimum number of non-zero elements	>=1	Integer	1	Specify the number of features to output (for CountBased method)

Outputs

Name	Type	Description
Filtered dataset	Data Table	Filtered dataset
Features	Data Table	Names of output columns and feature selection scores

Exceptions

Exception	Description
	Exception occurs if one or more specified columns of data set couldn't be found.
	Exception occurs if one or more of inputs are null or empty.
	Exception occurs if parameter is less than or equal to specific value.
	Exception occurs if one or more specified columns have type unsupported by current module.