Clip Values

 

Updated: October 6, 2017

Detects outliers and clips or replaces their values

Category: Data Transformation / Scale and Reduce

This article describes how to use the Clip Values module in Azure Machine Learning Studio, to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, a constant, or other substitute value.

You connect the module to a dataset that has the numbers you want to clip, choose the columns to work with, and then set a threshold or range of values, and a replacement method.

First, determine the columns you want to clip. The module applies the same criteria and replacement method to all columns that you include in the selection, so if you need to apply different criteria to different columns, use a new instance of Clip Values for each set of similar columns.

We also recommend that you test the method you'll apply to the data on a small subset of data first.

  1. Add the Clip Values module to your experiment and connect it to the dataset you want to modify.

  2. In List of columns, use the Column Selector to choose the columns to which Clip Values will be applied.

  3. For Set of thresholds, choose one of the following options from the dropdown list. These options determine how you set the upper and lower boundaries for acceptable values vs. values that must be clipped.

    • ClipPeaks

      When you clip values by peaks, you specify only an upper boundary. Values greater than that boundary value are replaced or removed.

    • ClipSubpeaks

      When you clip values by sub-peaks, you specify only a lower boundary. Values that are less than that boundary value are replaced or removed.

    • ClipPeaksAndSubpeaks

      When you clip values by peaks and sub-peaks, you can specify both the upper and lower boundaries. Values that are outside that range are replaced or removed. Values that match the boundary values are not changed.

  4. Depending on your selection int he preceding step, you can set the following threshold values:

    • Lower threshold: Displayed only if you choose ClipSubPeaks
    • Upper threshold: Displayed only if you choose ClipPeaks
    • Threshold: Displayed only if you choose ClipPeaksAndSubPeaks

    For each threshold type, choose either Constant or Percentile.

  5. If you select Constant, you must type hard-coded maximum or minimum values.

    For example, assuming you know the value 999 was used as a placeholder value, you could choose Constant for the upper threshold, and in Constant value of upper threshold, type 999. For the susbstitute value, you could specify that it be replaced with a null.

  6. If you choose Percentile, you must constrain the column values to a percentile range.

    For example, assume you want to keep only the values in the 10-80 percentile range, and replace all others with a mean. You would choose Percentile, and then type 10 for Percentile value of lower threshold, and type and 80 for Percentile value of upper threshold.

    See the section on percentiles for some examples of how to use percentile ranges.

  7. Define a substitute value.

    Numbers that exactly match the boundaries you just specified are considered to be inside the allowed range of values, and thus will not be replaced or removed. All numbers that fall outside the specified range are replaced with the substitute value.

    For example, Substitute value for peaks defines the value to substitute for all column values that are greater than the specified threshold. Substitute value for subpeaks defines the value to use as a substitute for all column values that are less than the specified threshold. If you use the ClipPeaksAndSubpeaks option, you can specify separate replacement values for the upper and lower clipped values.

    The following replacement values are supported:

    • Threshold. Replace clipped values with the specified threshold value.

    • Mean. Replace clipped values with the mean of the column values. The mean is computed before values are clipped.

    • Median. Replace clipped values with the median of the column values. The median is computed before values are clipped.

    • Missing. Replace clipped values with the missing (empty) value.

  8. Select the option, Add indicator columns, if you want to generate an indicator column that tells you whether or not the specified clipping operation applies to the data in that row. This option is particularly handy when you are testing a new set of clipping and substitution values.

  9. For Overwrite flag, indicate how you want the new values to be generated. By default, Clip Values constructs a new column with the peak values clipped to the desired threshold. New values overwrite the original column.

    To keep the original column and add a new column with the clipped values, deselect this option.

  10. Run the experiment.

    Right-click the output of the Clip Values module and select Visualize to review the values and make sure the clipping operation met your expectations.

This section walks through some common scenarios for replacing values.

Using Clip Values with Percentiles

To understand how clipping by percentiles works, consider a dataset with 10 rows, which have one instance each of the values 1-10.

If you are using percentile as the upper threshold, at the value for the 90th percentile, 90 percent of all values in the dataset must be less than that value.

If you are using percentile as the lower threshold, at the value for the 10th percentile, 10 percent of all values in the dataset must be less than that value.

  1. For Set of thresholds, choose ClipPeaksAndSubPeaks.

  2. For Upper threshold, choose Percentile, and for Percentile number, type 90.

  3. For Upper substitute value, choose Missing Value.

  4. For Lower threshold, choose Percentile, and for Percentile number, type 10.

  5. For Lower substitute value, choose Missing Value.

  6. Deselect the option Overwrite flag, and select the option, Add indicator column.

Now try the same experiment using 60 as the upper percentile threshold and 30 as the lower percentile threshold, and use the threshold value as the replacement value. The following table compares these two results:

  1. Replace with missing; Upper threshold = 90; Lower threshold = 10

  2. Replace with threshold; Upper percentile = 60; Lower percentile = 30

Original dataReplace with missingReplace with threshold
1

2

3

4

5

6

7

8

9

10
TRUE

TRUE

3, FALSE

4, FALSE

5, FALSE

6, FALSE

7, FALSE

8, FALSE

9, FALSE

TRUE
4, TRUE

4, TRUE

4, TRUE

4, TRUE

5, FALSE

6, FALSE

7, TRUE

7, TRUE

7, TRUE

7, TRUE
  • You can use Clip Values only on columns containing numbers or date/time values.

  • If you include columns that have text or categorical data, the columns will be skipped.

  • Missing values are ignored when the mean or median value is computed for a column.

  • Clip Values does not support ordinal data.

  • Missing values are not altered when they are propagated to the output dataset. The column indicating clipped values always contains FALSE for missing values.

NameTypeDescription
DatasetData TableInput dataset
NameRangeTypeDefaultDescription
Add indicator columnsTRUE/FALSEBooleanFALSEWhether to add indicator for clipping of a value is done
Constant value for lower thresholdanyFloat-1Value below which the subpeaks will be clipped
Constant value for upper thresholdanyFloat1Value above which the peaks will be clipped
Constant value of lower thresholdanyFloat-1Value below which the subpeaks are clipped
Constant value of upper threshold>=1Float1Value above which the peaks are clipped
List of columnsColumnSelectionList of columns to clip
Lower substitute valueThreshold

Mean

Median

Missing
SubstituteValuesThresholdThe value used for clipping subpeaks
Lower thresholdConstant

Percentile
Threshold ModeConstantValue below which the subpeaks will be clipped mode
Overwrite flagTRUE/FALSEBooleanTRUEWhether clipped data column(s) must overwrite input data column(s)
Percentile number for lower threshold[1;99]Integer1Percentile number below which the subpeaks will be clipped
Percentile number for upper threshold[1;99]Integer99Percentile number above which the peaks will be clipped
Percentile number of lower threshold[1;99]Integer1Percentile number below which the subpeaks are clipped
Percentile number of upper threshold[1;99]Integer99Percentile number above which the peaks are clipped
Set of thresholdsClipPeaks

ClipSubPeaks

ClipPeaksAndSubPeaks
Threshold SetClipPeaksSpecifies type of threshold to use
Substitute value for peaksThreshold

Mean

Median

Missing
SubstituteValuesThresholdThe value used during clipping peaks
Substitute value for subpeaksThreshold

Mean

Median

Missing
SubstituteValuesThresholdThe value used during clipping subpeaks
ThresholdConstant

Percentile
Threshold ModeConstantValue above and below which the peaks will be clipped mode
Upper substitute valueThreshold

Mean

Median

Missing
ThresholdThresholdThe value used for clipping peaks
Upper thresholdConstant

Percentile
Threshold ModeConstantValue above which the peaks will be clipped mode
NameTypeDescription
Results datasetData TableDataset with clipped columns

For a list of all module errors, see Module Error Codes.

ExceptionDescription
Error 0011Exception occurs if passed column set argument does not apply to any of dataset columns.
Error 0017Exception occurs if one or more specified columns have type unsupported by current module.

Scale and Reduce
A-Z Module List

Show: