Missing Values Scrubber (deprecated)

 

Updated: July 2, 2015

Specifies how to handle values that are missing from a dataset

The Missing Values Scrubber module provides some basic methods for handling missing values. It expects a dataset as input, and it returns a dataset with missing values that are substituted by a method that you select.

System_CAPS_warningWarning

This module is provided for backward compatibility with experiments created using the pre-release version of Azure Machine Learning Studio, and will soon be deprecated. We recommend that you modify your experiments to use Clean Missing Data instead.

There are multiple ways to handle missing values. However, whichever method you choose, it applies to the entire set of columns you have selected. If you need to treat missing values differently in some columns, use Project Columns to select a subset of data before applying Missing Values Scrubber.

  1. For missing values...

    Use this drop-down list to select which method to use to handle missing values.

  2. Replace with value...

    Type a replacement value. This is only needed if you choose to substitute missing values with a custom value. If you specify a floating-point value to replace missing values in a column of integers, the value substituted for the missing value will be this value converted to the nearest integer.

  3. Columns with all MV (missing values)...

    Choose whether to keep or remove columns with all values missing.

  4. MV (missing values) indicator column

    Choose whether to generate missing value indicator columns.

You can select from the following substitution methods to replace missing values:

Replace using MICE

Replaces missing values by using "Multiple Imputation by Chained Equations" (MICE), a method broadly used in the statistics literature since work by Donald Rubin in the 1970s. This method initializes the missing entries with a default value. Then it updates each column by using an appropriate regression or classification algorithm. These updates are repeated a number of times, as specified by the Number of Iterations parameter.

Replace with value

Assigns the replacement value to any missing value in every column with a data type of Integer, Double, Boolean, or Date. For date columns, the replacement value can also be entered as the number of 100-nanosecond ticks since 1/1/0001 12:00 AM.

Replace with mean

Assigns the replacement value to any missing value in every column with a data type of Integer, Double, or Boolean

Replace with median

Assigns the median of a column to any missing value in every column with a data type of Integer or Double.

Replace with mode

Assigns the mode of a column to any missing value in every column with a data type of Integer, Double, Boolean, or Categorical.

Remove row

Removes all rows in the dataset with one or more missing values. This is useful if the missing value can be considered as missing at random.

Do nothing

Leave the missing value.

Replace using probabilistic PCA

Replaces the missing values by using a linear model that analyzes the correlations between the columns and estimates a low-dimensional approximation of the data, from which the full data is reconstructed. The underlying dimensionality reduction is a probabilistic form of Principal Component Analysis (PCA). It implements a variant of the model proposed in the Journal of the Royal Statistical Society, Series B 21(3), 611–622 by Tipping and Bishop.

Compared to other options (such as MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it immediately approximates the covariance of the full dataset. It may therefore offer a better performance for datasets that have missing values in many columns.

The key limitations of this method are:

  • It expands categorical columns into numerical indicators and computes a dense covariance matrix of the resulting data.

  • It is not optimized for sparse representations.

For these reasons, datasets with large numbers of columns and/or large categorical domains (tens of thousands) are not supported due to prohibitive space consumption.

Name

Type

Description

Dataset

Data Table

Input dataset with missing values

Name

Range

Type

Default

Description

For missing values

List (subset)

Handling policy

Custom substitution value

Choose the method for handling missing values.

Co lumns with allvalues missing

Any

ColumnsWithAllValuesMissing

KeepColumns

Indicate if columns with all values missing should be preserved in the output.

Missing values indicator column

Any

GenerateMissingValueIndicatorColumns

DoNotGenerate

Select this option to add a column to the dataset that indicatesss wwwhetheeer that describes missing value handling should be added to the dataset.

Name

Type

Description

Results dataset

Data Table

Scrubbed dataset

For a list of all exceptions, see Machine Learning REST API Error Codes.

Exception

Description

Error 0002

An exception occurs if one or more parameters could not be parsed or converted from the specified type into the required by target method type.

Error 0003

An exception occurs if one or more of inputs are null or empty.

Error 0008

An exception occurs if the parameter is not in range.

Error 0013

An exception occurs if passed to module learner has invalid type.

Error 0018

An exception occurs if the input dataset is not valid.

Error 0039

An exception occurs if the operation failed.

Error 0010

An exception occurs if input datasets have column names that should match, but they do not.

Error 0016

An exception occurs if input datasets that are passed to the module should have compatible column types, but they do not.

Error 0067

An exception occurs if a dataset has a different number of columns than expected.

Error 0017

An exception occurs if one or more specified columns have a type that is unsupported by the current module.

Show: