Feature Selection Modules

 

Updated: May 31, 2017

This section describes the modules that are provided in Azure Machine Learning for performing feature selection. Each module takes a dataset as input, and applies well-known statistical methods to the data columns provided as input. The output is generally a set of metrics that you can use to identify the columns that have the best information value. Multiple feature selection modules are provided; you should choose one depending on the type of data that you have, and the requirements of the statistical technique that is applied.

In machine learning and statistics, feature selection is the process of selecting a subset of relevant, useful features for use in building an analytical model. Feature selection helps narrow the field of data to just the most valuable inputs, reducing noise and improving training performance.

Often features are created from the raw data through a process of feature engineering. For example, a time stamp in itself might not be useful for modelling until transformed into units of days, months, or categories relevant to the problem such as holiday vs. working day, etc.

New users of machine learning are often tempted to add in all data that is available, expecting the algorithm will find something interesting. However, feature selection can improve your model and prevent problems.

  • The data might contain redundant or irrelevant features, which provide no more information than the currently selected features.

  • The data might also contain irrelevant features that provide no useful information in any context.

  • Including irrelevant fields not only increases the time for training the data, but also can lead to poor results.

  • With some algorithms, having duplicate information in the training data can lead to a phenomenon called multicollinearity, in which the presence of two highly correlated variables can cause the calculations for other variables to become much less accurate.

Machine Learning Studio provides multiple feature selection methods so that you can work with all data types. Additionally, some machine learning algorithms use some kind of feature selection or dimensionality reduction as part of the learning process. When you use these learners, you can skip the feature selection process and let the algorithm decide the best inputs.

These options are summarized here to help you choose the method most suited to your problem and to your data.

The following feature selection modules are provided in Azure Machine Learning Studio.

Filter-Based Feature Selection

The Filter-Based Feature Selection module lets you choose from among well-known feature selection methods, and outputs both the feature selection statistics, and the filtered dataset.

Your choice of a filter selection method depends in part on what sort of input data you have.

MethodSupported Feature InputsSupported Labels
Pearson's correlationNumeric and logical columns onlyA single numeric or logical column
Mutual information scoreAll data typesA single column of any data type
Kendall's correlation coefficientNumeric and logical columns onlyA single numeric or logical column. Columns should have values that can be ranked.
Spearman's correlation coefficientNumeric and logical columns onlyA single numeric or logical column
Chi-squared statisticAll data typesA single column of any data type
Fisher scoreNumeric and logical columns onlyA single numeric or logical column. String columns are assigned a score of 0.
Count based feature selectionAll data typesA label column is not required

Fisher Linear Discriminant Analysis

Linear Discriminant Analysis is a supervised learning technique that can be used for classifying numerical variables in conjunction with a single categorical target. The method is useful for feature selection because it identifies the combination of features or parameters that best separates the groups.

You can use the Fisher Linear Discriminant Analysis module just to generate a set of scores for review, or you can use the replacement dataset generated by the module for training.

Permutation Feature Importance

Permutation Feature Importance lets you simulate the effect of any set of features on your dataset, by computing performance scores for a model based on random shuffling of feature values.

The scores that the module returns represent the potential change in the accuracy of a trained model if values change. You can use the scores to determine the effect of individual variables on the model.

Related Tasks

Although the following modules are not included in the Feature Selection group, they can be useful in reducing the dimensionality of your data, or for finding correlations.

  • Principal Component Analysis

    When you have a dataset with many columns, you can use the Principal Component Analysis module to detect the columns that are the most useful; that is, the columns that contain the most information (variation) about the original data.

    This module can be found in the the Data Transformation group, under Scale and Reduce.

  • Learning with Counts

    Count-based featurization is a new technique for determining useful features from large datasets. Using these modules, you can analyze datasets to find the best features, save a set of features to use with new data, or update an existing feature set.

  • Compute Linear Correlation

    Use this module to compute a set of Pearson correlation coefficients for each possible pair of variables in the input dataset. The Pearson correlation coefficient is also called Pearson’s R test, is a statistical value that measures the linear relationship between two variables.

    This module can be found in the Statistical Functions group.

Feature selection is typically performed when exploring data and developing a new model. You add a feature selection module to your experiment and attach a dataset to generate scores that inform your decision of which columns to use, and if the values in the columns are valid. You might remove the feature selection tasks from the experiment when you operationalize a model, and only periodically check the data to be sure that features have not changed.

Note that feature selection is different from feature engineering, which aims to create new features out of existing data.

Machine Learning Methods that Use Feature Selection

Some learners in Azure Machine Learning Studio also provide parameters that can be used to optimize feature selection when training. If you are using a method that has its own heuristic for choosing features, it is often better to rely on that heuristic rather than pre-selecting features.

  • Boosted Decision Tree Classification Models; Boosted Decision Tree Regression Models

    In these modules, internally a feature summary is created and features with weight 0 are not used by any tree splits.

    When you visualize the best trained model you can look at each of the trees. If a feature is never used in any tree then it is likely a candidate a for removal.

    Parameter sweeping is also recommended to optimize selection.

  • Logistic Regression Models; Linear Classification Models

    The modules for multiclass and binary logistic regression support L1 and L2 regularization.

    Regularization is a way of adding constraints when training to manually specify some aspect of the learned model. Regularization is generally used to avoid overfitting. Machine Learning Studio supports regularization for the L1 or L2 norms of the weight vector in linear classification algorithms.

    • L1 regularization is useful if the goal is to have a model that is as sparse as possible.

    • L2 regularization prevents any single coordinate in the weight vector from growing too much in magnitude, so it is useful if the goal is to have a model with small overall weights.

    • L1-regularized logistic regression is more aggressive about assigning a weight of 0 to features, and therefore useful in identifying features that can be removed.

All feature selection modules and analytical methods that support numeric and logical columns support date-time and time span columns as well. These columns are treated as simple numeric columns where each value equals to the number of ticks.

The Feature Selection category includes these modules:

ModuleDescription
Filter Based Feature SelectionIdentifies the features in a dataset with the greatest predictive power
Fisher Linear Discriminant AnalysisIdentifies the linear combination of feature variables that can best group data into separate classes
Permutation Feature ImportanceComputes the permutation feature importance scores of feature variables given a trained model and a test dataset

Module Categories and Descriptions

Show: