Filter Based Feature Selection

 

Updated: September 21, 2017

Identifies the features in a dataset with the greatest predictive power

Category: Feature Selection Modules

This article describes how to use the Filter Based Feature Selection module in Azure Machine Learning, to identify the columns in your input dataset that have the greatest predictive power.

In general, feature selection refers to the process of applying statistical tests to inputs, given a specified output, to determine which columns are more predictive of the output. The Filter Based Feature Selection module provides multiple feature selection algorithms to choose from, including correlation methods such as Pearsons's or Kendall's correlation, mutual information scores, and chi-squared values. Azure Machine Learning also supports feature value counts as an indicator of information value.

When you use the Filter Based Feature Selection module, you provide a dataset, identify the column that contains the label or dependent variable ,and then specify a single method to use in measuring feature importance.

The module outputs a dataset that contains the best feature columns, as ranked by predictive power. It also outputs the names of the features and their scores from the selected metric.

Why Use Filter-Based Feature Selection?

This module for feature selection is called "filter-based" because you use the selected metric to identify irrelevant attributes, and filter out redundant columns from your model. You choose a single statistical measure that suits your data, and the module calculates a score for each feature column. The columns are returned ranked by their feature scores.

By choosing the right features, you can potentially improve the accuracy and efficiency of classification.

You can then use only the columns with the best scores to build your predictive model. Columns with poor feature selection scores can be left in the dataset and ignored when you build a model.

How to Choose a Feature Selection Metric

The Filter-Based Feature Selection provides a variety of metrics for assessing the information value in each column. This section provides a general description of each metric, and how it is applied. Additional requirements for using each metric are stated in the Technical Notes section and in the instructions for configuring the module.

  • Pearson Correlation

    Pearson’s correlation statistics, or Pearson’s correlation coefficient, is also known in statistical models as the r value. For any two variables, it returns a value that indicates the strength of the correlation.

    Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. The coefficient is not affected by changes of scale in the two variables.

  • Mutual Information

    The Mutual Information Score method measures the contribution of a variable towards reducing uncertainty about the value of another variable: namely, the label. Many variations of the mutual information score have been devised to suit different distributions.

    The mutual information score is particularly useful in feature selection because it maximizes the mutual information between the joint distribution and target variables in datasets with many dimensions.

  • Kendall Correlation

    Kendall's rank correlation is one of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable. In other words, it measures the similarity of orderings when ranked by the quantities. Both this coefficient and Spearman’s correlation coefficient are designed for use with non-parametric and non-normally distributed data.

  • Spearman Correlation

    Spearman's coefficient is a nonparametric measure of statistical dependence between two variables, and is sometimes denoted by the Greek letter rho. The Spearman’s coefficient expresses the degree to which two variables are monotonically related. It is also called Spearman rank correlation, because it can be used with ordinal variables.

  • Chi Squared

    The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. The method assumes that variables are random and drawn from an adequate sample of independent variables. The resulting chi-squared statistic indicates how far results are from the expected (random) result.

  • Fisher Score

    The Fisher score (also called the Fisher method, or Fisher combined probability score) is sometimes termed the information score, because it represents the amount of information that one variable provides about some unknown parameter on which it depends.

    The score is computed by measuring the variance between the expected value of the information and the observed value. When variance is minimized, information is maximized. Since the expectation of the score is zero, the Fisher information is also the variance of the score.

  • Count Based

    Count-based feature selection is a simple yet relatively powerful way of finding information about predictors. The basic idea underlying count-based featurization is simple: by calculating counts of individual values within a column, you can get an idea of the distribution and weight of values, and from this, understand which columns contain the most important information.

    Count-based feature selection is a non-supervised method of feature selection, meaning you don't need a label column. This method also reduces the dimensionality of the data without losing information.

    For more information about how count-based features are created and why they are useful in machine learning, see Learning with Counts.

System_CAPS_ICON_tip.jpg Tip

If you need a different option for custom feature selection method, use the Execute R Script module.

This module provides two methods for determining feature scores:

Generate feature scores using a traditional statistical metric

  1. Add the Filter-Based Feature Selection module to your experiment. You can find it in the list of modules in Studio, in the Feature Selection group.

  2. Connect an input dataset that contains at least two columns that are potential features.

    To ensure that a column should be analyzed and a feature score generated, use the Edit Metadata module to set the IsFeature attribute.

    System_CAPS_ICON_important.jpg Important

    Ensure that the columns you are providing as input are potential features. For example, a column that contains a single value has no information value.

    If you know there are columns that would make bad features, you can remove them from the column selection. You could also use the Edit Metadata module to flag them as Categorical.

  3. For Feature scoring method, choose one of the following established statistical methods to use in calculating scores.

    MethodRequirements
    Pearson CorrelationLabel can be text or numeric. Features must be numeric.
    Mutual InformationLabels and features can be text or numeric. Use this method for computing feature importance for two categorical columns.
    Kendall CorrelationLabel can be text or numeric but features must be numeric.
    Spearman CorrelationLabel can be text or numeric but features must be numeric.
    Chi SquaredLabels and features can be text or numeric. Use this method for computing feature importance for two categorical columns.
    Fisher ScoreLabel can be text or numeric but features must be numeric.
    CountsSee: To use Count-Based Feature Selection
    System_CAPS_ICON_tip.jpg Tip

    If you change the selected metric, all other selections will be reset, so be sure to set this option first!)

  1. Select the Operate on feature columns only option to generate a score only for those columns that have been previously marked as features.

    If you deselect this option, the module will create a score for any column that otherwise meets the criteria, up to the number of columns specified in Number of desired features.

  2. For Target column, click Launch column selector to choose the label column either by name or by its index (indexes are one-based).

    A label column is required for all methods that involve statistical correlation. The module returns a design-time error if you choose no label column or multiple label columns.

  3. For Number of desired features, type the number of feature columns you want returned as a result.

    • The minimum number of features you can specify is 1, but we recommend that you increase this value.

    • If the specified number of desired features is greater than the number of columns in the dataset, then all features are returned, even those with zero scores.

    • If you specify fewer result columns than there are feature columns, the features are ranked by descending score, and only the top features are returned.

  4. Run the experiment, or select the Filter Based Feature Selection module and then click Run selected.

  5. View and interpret the results:

    • To see a complete list of the feature columns that were analyzed, and their scores, right-click the module, select Features, and click Visualize.

    • To view the dataset that is generated based on your feature selection criteria, right-click the module, select Dataset, and click Visualize.

      If the dataset contains fewer columns than you expected, check the module settings, and the data types of the columns provided as input. For example, if you set Number of desired features to 1, the output dataset contains just two columns: the label column, and the most highly ranked feature column.

Use count-based feature selection

  1. Add the Filter-Based Feature Selection module to your experiment. You can find it in the list of modules in Studio, in the Feature Selection group.

  2. Connect an input dataset that contains at least two columns that are possible features.

  3. Select Count Based from the list of statistical methods in the Feature scoring method dropdown list.

  4. For Minimum number of non-zero elements, indicate the minimum number of feature columns to include in the output.

    By default, the module will output all columns that meet the requirements. The module cannot output any column that gets a score of zero.

  5. Run the experiment, or select just the module, and click Run Selected.

  6. View the results:

    • To see the list of feature columns with their scores, right-click the module, select Features, and click Visualize .
    • To see the dataset containing the analyzed columns, right-click the module, select Dataset, and click Visualize.

    Unlike other methods, the Count Based feature selection method does not rank the variables by highest scores, but returns all variables with a non-zero score, in their original order.

    String features always get a zero (0) score and are thus are not output.

You can see examples of how this module is used by exploring these sample experiments in the Model Gallery:

  • In the third step of the Text Classification sample, Filter-Based Feature Selection is used to identify the 15 best features. Feature hashing is used to convert the text documents to numeric vectors. Pearson’s correlation is then used on the vector features.

  • This article provides an introduction to feature selection and feature engineering in machine learning: Machine learning feature selection and feature engineering

To see some examples of feature scores, see Table of scores compared.

If you use Pearson Correlation, Kendall Correlation, or Spearman Correlation on a numeric feature and a categorical label, the feature score is calculated as follows:

  1. For each level in the categorical column, compute the conditional mean of numeric column.

  2. Correlate the column of conditional means with the numeric column.

Requirements

  • A feature selection score cannot be generated for any column that is designated as a label or as a score column.

  • If you attempt to use a scoring method with a column of a data type not supported by the method, either the module will raise an error, or a zero score will be assigned to the column.

  • If a column contains logical (true/false) values, they are processed as True = 1 and False = 0.

  • A column cannot be a feature if it has been designated as a Label or a Score.

Missing values

  • You cannot specify as a target (label) column any column that has all missing values.

  • If a column contains missing values, they are ignored when computing the score for the column.

  • If a column designated as a feature column has all missing values, a zero score is assigned.

Table of scores compared

To give you an idea of how the scores compare when using different metrics, the following table presents some feature selection scores from multiple features in the automobile price dataset, given the dependent variable highway-mpg.

Feature columnPearson scoreCount scoreKendall scoreMutual information
highway-mpg120511
city-mpg0.9713372050.8924720.640386
curb-weight0.7974651710.6734470.326247
horsepower0.7709082030.7282890.448222
price0.7046922010.6518050.321788
length0.7046622052050.531930.281317
engine-size0.677472050.5818160.342399
width0.6772182050.5255850.285006
bore0.5945722010.4673450.263846
wheel-base0.5440822050.4076960.250641
compression-ratio0.2652012050.3370310.288459
fuel-systemnanana0.308135
makenanana0.213872
drive-wheelsnanana0.213171
heightnanana0.1924
normalized-lossesnanana0.181734
symbolingnanana0.159521
num-of-cylindersnanana0.154731
engine-typenanana0.135641
aspirationnanana0.068217
body-stylenanana0.06369
fuel-typenanana0.049971
num-of-doorsnanana0.017459
engine-locationnanana0.010166
  • Mutual information scores can be created for all column types, including strings.

  • The other scores included in this table, such as Pearson's correlation or count-based feature selection, require numeric values. String features get a score of 0 and hence are not included in the output. For exceptions, see the Technical Notes section.

  • The count-based method does not treat a label column any differently from feature columns.

NameTypeDescription
DatasetData TableInput dataset
NameRangeTypeDefaultDescription
Feature scoring methodListScoring methodChoose the method to use for scoring
Operate on feature columns onlyAnyBooleantrueIndicate whether to use only feature columns in the scoring process
Target columnAnyColumnSelectionNoneSpecify the target column
Number of desired features>=1Integer1Specify the number of features to output in results
Minimum number of non-zero elements>=1Integer1Specify the number of features to output (for CountBased method)
NameTypeDescription
Filtered datasetData TableFiltered dataset
FeaturesData TableNames of output columns and feature selection scores
ExceptionDescription
Error 0001Exception occurs if one or more specified columns of data set couldn't be found.
Error 0003Exception occurs if one or more of inputs are null or empty.
Error 0004Exception occurs if parameter is less than or equal to specific value.
Error 0017Exception occurs if one or more specified columns have type unsupported by current module.

Feature Selection
Fisher Linear Discriminant Analysis
A-Z Module List

Show: