Clean Missing Data

 

Updated: September 21, 2017

Specifies how to handle the values missing from a dataset

Category: Data Transformation / Manipulation

This article describes how to use the Clean Missing Data module in Azure Machine Learning Studio to remove, replace, or infer missing values from a dataset.

Data scientists often check data for missing values and then perform various operations to fix the data or insert new values. The goal of such cleaning operations is to prevent problems caused by missing data that can arise when training a model.

This module supports multiple type of operations for "cleaning" missing vaues, including:

  • replace missing values with a placeholder, mean, or other value
  • completely remove rows and columns that have missing values
  • infer values based on statistical methods
System_CAPS_ICON_tip.jpg Tip

New to machine learning? This article provides a good explanation of why you would use each of the different methods for replacing missing values: Methods for handling missing values

Using this module does not change your source dataset. Instead, it creates a new dataset in your workspace that you can use in the subsequent workflow. You can also save the new, cleaned dataset for reuse.

This module also outputs a definition of the transformation used to clean the missing values. You can re-use this transformation on other datasets that have the same schema, by using the Apply Transformation module.

This module lets you define a cleaning operation. You can also save the cleaning operation so that you can apply it later to new data. See the following links for a description of how to create and save a cleaning process:

System_CAPS_ICON_important.jpg Important

The cleaning method that you use for handling missing values can dramatically affect your results. We recommend that you experiment with different methods. Consider both the justification for use of a particular method, and the quality of the results.

To replace missing values

Each time that you apply the Clean Missing Data module to a set of data, the same cleaning operation is applied to all columns that you select. Therefore, if you need to clean different columns using different methods, use separate instances of the module.

  1. Add the Clean Missing Data module to your experiment, and connect the dataset that has missing values.

  2. For Columns to be cleaned, choose the columns that contain the missing values you want to change. You can choose multiple columns, but you must use the same replacement method in all selected columns. Therefore, typically you will need to clean string columns and numeric columns separately.

    System_CAPS_ICON_note.jpg Note

    Any cleaning or replacement method that you choose must be applicable to all columns in the selection. If the data in any column is incompatible with the specified operation, the module will return an error and stop the experiment.

  3. For Minimum missing value ratio, specify the minimum number of missing values required for the operation to be performed.

    You use this option in combination with Maximum missing value ratio to define the conditions under which a cleaning operation will be performed on the dataset. If there are too many or too few rows that are missing values, the operation will not be performed.

    The number you enter represents the ratio of missing values to all values in the column. By default, the Minimum missing value ratio property is set to 0. This means that missing values will be cleaned even if there is only one missing value. For an example of how to use this option, see Setting a Threshold for Cleaning Operations.

    System_CAPS_ICON_warning.jpg Warning

    This condition must be met by each and every column in order for the specified operation to apply. For example, assume you selected three columns and then set the minimum ratio of missing values to .2 (20%), but only one column actually has 20% missing values. In this case, the cleanup operation would apply only to the column with over 20% missing values. Therefore, the other columns would be unchanged.

    If you have any doubt about whether missing values were changed, select the option, Generate missing value indicator column. A column will be appended to the dataset to indicate whether or not each column met the specified criteria for the minimum and maximum ranges.

  4. For Maximum missing value ratio, specify the maximum number of missing values that can be present for the operation to be performed.

    For example, you might want to perform missing value substitution only if 30% or fewer of the rows contain missing values, but leave the values as is if more than 30% of rows have missing values.

    You define the number as the ratio of missing values to all values in the column. By default, the Maximum missing value ratio is set to 1. This means that missing values will be cleaned even if 100% of the values in the column are missing.

    System_CAPS_ICON_note.jpg Note

    When you set a threshold using the options Minimum missing value ratio or Maximum missing value ratio, the cleaning operation will not be performed if even one of the selected columns does not meet the criteria.

  5. For Cleaning Mode, select one of the following options for replacing or removing missing values:

    • Replace using MICE
      For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as "Multivariate Imputation using Chained Equations" or "Multiple Imputation by Chained Equations".

      In a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values. In contrast, in a single imputation method (such as replacing a missing value with a column mean) a single pass is made over the data to determine the fill value.

      All imputation methods introduce some error or bias, but multiple imputation better simulates the process generating the data and the probability distribution of the data.

      For a general introduction to methods for handling missing values, see Missing Data: the state of the art. Schafer and Graham, 2002.

      System_CAPS_ICON_warning.jpg Warning

      The Replace using MICE option cannot be applied to completely empty columns. Such columns must be removed or passed to the output as is.

    • Custom substitution value
      Use this option to specify a placeholder value (such as a 0 or NA) that applies to all missing values.

      The value that you specify as a replacement must be compatible with the data type of the column.

    • Replace with mean
      Calculates the column mean and uses the mean as the replacement value for each missing value in the column.

      Applies only to columns that have Integer, Double, or Boolean data types. See the Technical Notes section for more information.

    • Replace with median
      Calculates the column median value and assigns that as the replacement for any missing value in the column.

      Applies only to columns that have Integer or Double data types. See the Technical Notes section for more information.

    • Replace with mode
      Calculates the mode for the column and uses that as the replacement value for every missing value in the column.

      Applies to columns that have Integer, Double, Boolean, or Categorical data types. See the Technical Notes section for more information.

    • Remove entire row
      Completely removes any row in the dataset that has one or more missing values. This is useful if the missing value can be considered randomly missing.

    • Remove entire column
      Completely removes any column in the dataset that has one or more missing values.

    • Replace using Probabilistic PCA
      Replaces the missing values by using a linear model that analyzes the correlations between the columns and estimates a low-dimensional approximation of the data, from which the full data is reconstructed. The underlying dimensionality reduction is a probabilistic form of Principal Component Analysis (PCA), and it implements a variant of the model proposed in the Journal of the Royal Statistical Society, Series B 21(3), 611–622 by Tipping and Bishop.

      Compared to other options, such as Multiple Imputation using Chained Equations (MICE), this option has the advantage of not requiring the application of predictors for each column. Instead, it approximates the covariance for the full dataset. It may therefore offer better performance for datasets that have missing values in many columns.

      The key limitations of this method are that it expands categorical columns into numerical indicators and computes a dense covariance matrix of the resulting data. It also is not optimized for sparse representations. For these reasons, datasets with large numbers of columns and/or large categorical domains (tens of thousands) are not supported due to prohibitive space consumption.

      System_CAPS_ICON_tip.jpg Tip

      Remember that the method you choose will be applied to all columns in the selection. Thus, if you want to replace some missing values with zeroes in some columns but insert a placeholder in other columns, you should use Select Columns in Dataset to separate the data and use different instances of the Clean Missing Data module.

  6. The option Replacement value will become available only if you have selected the option, Custom substitution value. Type a new value, and it will be used as the replacement value for all missing values in the column.

    Note that you can use this option only in columns that have the Integer, Double, Boolean, or Date data types. For date columns, the replacement value can also be entered as the number of 100-nanosecond ticks since 1/1/0001 12:00 A.M.

  7. Select the option, Generate missing value indicator column, if you want to output some indication of whether the values in the column met the criteria for missing value cleaning. This option is particularly useful when you are setting up a new cleaning operation and want to make sure it works as designed.

  8. Run the experiment, or select the Clean Missing Data module and click Run selected.

  9. The module returns two outputs:

    • Cleaned dataset. A dataset comprised of the selected columns, with missing values handled as specified, along with an indicator column, if you selected that option.

      Columns not selected for cleaning are also "passed through".

    • Cleaning transformation. The definition of the data transformation used for cleaning can be saved in your workspace and applied to new data later.

To apply a cleaning operation to new data

If you need to repeat cleaning operations often, we recommend that you save your recipe for data cleansing as a transform, to reuse with the same dataset. Saving a cleaning transformation is particularly useful if you must frequently re-import and then clean data that has the same schema.

  1. Add the Apply Transformation module to your experiment.

  2. Add the dataset you want to clean, and connect the dataset to the right-hand input port.

  3. Expand the Transforms group in the left-hand pane of Studio. Locate the saved transformation and drag it into the experiment.

  4. Connect the saved transformation to the left input port of Apply Transformation.

  5. Note that when you apply a saved transformation, you cannot select the columns to which the transformation will be applied. That is because the transformation has been already defined and applies automatically to the data types specified in the original operation.

    One benefit is that, if you created a transformation on a subset of numeric columns, you can later apply this transformation to a dataset of mixed column types without raising an error, and the missing values are changed only in the matching numeric columns.

  6. Run the experiment.

You can see examples of how this module is used by exploring these sample experiments in the Model Gallery:

This section contains implementation details, as well as known issues and commonly asked questions.

  • An error occurs if the mean or median option is used when any string columns are selected. If you need to process columns of different data types, create two instances of Clean Missing Data.

  • When replacing missing values with a mean value in columns with the Boolean, Integer, DateTime, or TimeSpan data types, the column is first converted to floating point numbers, the mean is calculated, and then the result is rounded to the nearest value of the original data type.

  • When you type a replacement value, the value must be compatible with the data type in the selected column.

  • Values of NaN, Inf, and –Inf are allowed for columns where the data type is Double.

  • When using the MICE method, the replacement value is predicted by using the trained MICE model.

  • Using Clean Missing Data can reset column types to feature. If your data contains other types of columns, such as labels, use Edit Metadata to correct the column types.

Restrictions on Cleaning Transformations

The following restrictions apply when you use a saved transformation (based on Clean Missing Data) to new data:

  • A saved transformation cannot generate indicator values, even if this option was used in the original cleaning operation.

    Consider the indicator values as most useful when testing a new transformation.

  • The transformation does not calculate new values based on the new dataset. In other words, if you used Clean Missing Data on Dataset A and generated a mean value of 0.5, that same value would be applied as the mean for replacing missing values in Dataset B, regardless of the actual values in Dataset B.

  • The data type of the columns in the new dataset must match the data type of the column that the transformation was originally created on.

    If any other operations are performed on the column that implicitly change the data type, you will get an error.

    For example, suppose you create a mean for an integer data column [Col1], and save the transformation. Now you want to apply the cleanup transformation to a copy of [Col1] that has been adjusted using a formula, such as ([Col1] /1.5). To ensure that the result is an integer, you round up the result, but still get an error when you apply the transformation. However, no error is raised when you adjust the value using a formula such as ([Col 1] * 10). You can use Edit Metadata to explicitly reset the data type to integer. Note that, in general, operations in Apply Math Operation module implicitly change numeric columns to double.)

Setting a Threshold for Cleaning Operations

When you specify a threshold for cleaning operations using the options Minimum missing value ratio or Maximum missing value ratio, the results can be unexpected or confusing.

To demonstrate how the options for maximum and minimum missing values work, we have provided some examples from the automobile prices sample dataset, which has many columns with missing values. The following table shows the count of missing values for several columns in that dataset, together with the ratio of missing values computed on the dataset. The ratio of missing values (in the rightmost column) is the value that would be used in evaluating the dataset against the specified threshold values.

So let's assume that you set Minimum missing value ratio to 0.019 and set Maximum missing value ratio to 0.020. What would happen?

Column nameCount of missing valuesRatio of missing values
Normalized-losses410.2
Bore40.019512195
Stroke40.019512195
Compression ratio00
  • The normalized-losses and compression-ratio columns do not meet the threshold criteria.

  • The bore and stroke columns meet the threshold criteria

Because some columns in the selection did not meet the specified criteria, no cleaning operation was performed on any column. To help you figure out what happened, the module returns the value FALSE in the two indicator columns, bore_IsMissing and stroke_IsMissing.

However, if you change the threshold back to the default values of 0 for Minimum missing value ratio and 1 for Maximum missing value ratio, an indicator column is returned for all selected columns, and the specified operation is performed.

System_CAPS_ICON_tip.jpg Tip

If you are uncertain about whether missing value clean-up is working as expected, select the Generate missing value indicator column option.

Known Issues

  • Error when MICE is applied on training data set and test set contains missing values

    If you use the MICE method to clean data and then process a dataset that contains missing values, you might get the following error:

    AFx Library library exception: Model is not trained. . ( Error 1000 )

    This error occurs only when the MICE method is selected, and if the training dataset does not contain missing values but the test dataset does. This issue will be fixed in an upcoming refresh.

NameTypeDescription
DatasetData TableDataset to be cleaned
NameRangeTypeDefaultDescription
Columns to be cleanedAnyColumnSelectionAllSelect columns for the missing values clean operation.
Minimum missing value ratio[0.0;1.0]Float0.0Clean only column with missing value ratio above the specified value, out of a set of all selected columns.
Maximum missing value ratio[0.0;1.0]Float1.0Clean only columns with missing value ratio below the specified value out of a set of all selected columns.
Cleaning modeListHandling policyCustom substitution valueChoose an algorithm to use when cleaning missing values.
Replacement valueAnyString"0"Type a value to take the place of missing values.

This value is optional.
Cols with all missing valuesAnyColumnsWithAllValuesMissingRemoveIndicate if columns of all missing values should be preserved in the output.
Generate missing value indicator columnAnyBooleanfalseGenerate a column that indicates which rows were cleaned.
Number of iterations[1;10]Integer5Specify the number of iterations when using MICE.
Number of iterations for PCA prediction[1;50]Integer10Specify the number of iterations when using a PCA prediction.
NameTypeDescription
Cleaned datasetData TableCleaned dataset
Cleaning transformationITransform interfaceTransformation that is to be passed to the Apply Transformation module to clean new data.

For a list of all exceptions, see Module Error Codes.

ExceptionDescription
Error 0002An exception occurs if one or more parameters could not be parsed or converted from the specified type into the type required by the target method.
Error 0003An exception occurs if one or more input datasets are null or empty.
Error 0008An exception occurs if a parameter is not in range.
Error 0013An exception occurs if the leaner passed to the module has an invalid type.
Error 0018An exception occurs if the input dataset is not valid.
Error 0039An exception occurs if the operation fails.

Manipulation
Data Transformation
A-Z Module List

Show: