Clip Values

 

Updated: June 21, 2016

Detects outliers and clips or replaces their values

You can use the Clip Values module to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, or threshold value.

There are multiple ways to identify values for clipping: you can set a value to use as the upper or lower boundary, or you can specify a percentile range. All values outside the range are replaced or removed using the method you specify.

 

  1. Add the Clip Values module to your experiment and connect it to the dataset you want to modify.

    If you want to modify only a few columns, you can use Select Columns in Dataset to choose a subset of columns.

  2. In List of columns, use the Column Selector to choose the columns to which Clip Values will be applied.

    Note that you can apply only one kind of transformation at a time. Therefore, you might need to use multiple instances of Clip Values to adjust different types of numeric data.

  3. Indicate the method that you want to apply when identifying target values.

    • ClipPeaks 

      If you clip values by peaks, you specify only an upper boundary. Values greater than that boundary value are replaced or removed.

    • ClipSubpeaks 

      If you clip only by sub-peaks, you specify only a lower boundary. Values that are less than that boundary value are replaced or removed.

    • ClipPeaksAndSubpeaks 

      If you clip values by peaks and sub-peaks, you can specify both the upper and lower boundaries. Values that are outside that range are replaced or removed. Values that match the boundary values are not changed.

  4. Specify the value used to define the threshold or boundary. Values can be defined as constants or as percentiles:

    • Constant . Choose this option to specify a single, hard-coded maximum or minimum value.

      For example, you might know that 999 has been used as a placeholder value and clip that value, replacing it with a null.

    • Percentile . Choose this option to constrain the column values to a percentile range. Then, type the upper or lower percentile range values as required. See the section on percentiles for some examples of how these values work.

      For example, you might want to keep only the values in the 10-80 percentile range, and replace all others with a mean.

      System_CAPS_tipTip

      See the section on percentiles for some examples of how to use percentile ranges.

    Depending on the clipping option you selected, there are multiple options for defining an upper boundary, lower boundary, or both.

    • Upper threshold. Select the type of value to use for upper boundary: a constant, or percentile. Then, set the value:

      • Constant value of upper threshold. Type the exact value for the upper boundary.

      • Percentile number of upper threshold. Type an integer value representing the upper percentile boundary.

    • Lower threshold. Select the type of value to use for lower boundary: a constant, or percentile. Then, set the value:

      • Constant value of lower threshold. Type the exact value for the upper boundary.

      • Percentile value for lower threshold. Type an integer value representing the lower percentile boundary.

    System_CAPS_tipTip

    Values that exactly match the boundary value are considered in the range and will not be replaced or removed.

  5. Specify a replacement value that will be used for values outside the specified range.

    •  Substitute value for peaks. Provide the value to substitute for all column values that are greater than the specified threshold.

    •  Substitute value for subpeaks. Provide the value to substitute for all column values that are less than the specified threshold.

    If you use the ClipPeaksAndSubpeaks option, you can specify separate replacement values for the upper and lower clipped values.

    • Lower substitute value. Provide the value to substitute for all column values that are less than the specified threshold.

    •  Upper substitute value. Provide the value to substitute for all column values that are greater than the specified threshold. You can use any of these options for the substitute value:

    The following types of replacement value are supported:

    • Threshold. Replace clipped values with the specified threshold value.

    • Mean. Replace clipped values with the mean of the column values. The mean is computed before values are clipped.

    • Median. Replace clipped values with the median of the column values before clipping.

    • Missing. Replace clipped values with the missing (empty) value.

  6. Select the option, Add indicator columns, if you want to generate an indicator column that tells you whether or not a clipping operation would be applied to the data in each row. This is particularly handy when you are testing a new set of clipping and substitution values.

  7. For Overwrite flag, indicate how you want the new values to be generated. By default, Clip Values constructs a new column with the peak values clipped to the desired threshold. New values overwrite the original column.

    To keep the original column and add a new column with the clipped values, deselect this option.

  8. Run the experiment.

    Right-click the output of the Clip Values module and select Visualize to review the values and make sure the clipping operation met your expectations.

 

To understand how clipping by percentiles works, consider a dataset with 10 rows, which have one instance each of the values 1-10.

If you are using percentile as the upper threshold, at the value for the 90th percentile, 90 percent of all values in the dataset must be less than that value.

If you are using percentile as the lower threshold, at the value for the 10th percentile, 10 percent of all values in the dataset must be less than that value.

  1. For Set of thresholds, choose ClipPeaksAndSubPeaks.

  2. For Upper threshold, choose Percentile, and for Percentile number, type 90.

  3. For Upper substitute value, choose Missing Value.

  4. For Lower threshold, choose Percentile, and for Percentile number, type 10.

  5. For Lower substitute value, choose Missing Value.

  6. Deselect the option Overwrite flag, and select the option, Add indicator column.

Now try the same experiment using 60 as the upper percentile threshold and 30 as the lower percentile threshold, and use the threshold value as the replacement value. The following table compares these two results:

  1. Replace with missing; Upper threshold = 90; Lower threshold = 10

  2. Replace with threshold; Upper percentile = 60; Lower percentile = 30

Original data

Replace with missing

Replace with threshold

1

2

3

4

5

6

7

8

9

10

 

TRUE

 

TRUE

3

FALSE

4

FALSE

5

FALSE

6

FALSE

7

FALSE

8

FALSE

9

FALSE

 

TRUE

4

TRUE

4

TRUE

4

TRUE

4

TRUE

5

FALSE

6

FALSE

7

TRUE

7

TRUE

7

TRUE

7

TRUE

  • You can use Clip Values only on columns containing numbers or date/time values.

  • If you include columns that have text or categorical data, the columns will be skipped.

  • Missing values are ignored when the mean or median value is computed for a column.

  • Clip Values does not support ordinal data.

  • Missing values are not altered when they are propagated to the output dataset. The column indicating clipped values always contains FALSE for missing values.

Name

Type

Description

Dataset

Data Table

Input dataset

Name

Range

Type

Default

Description

Add indicator columns

TRUE/FALSE

Boolean

FALSE

Whether to add indicator for clipping of a value is done

Constant value for lower threshold

any

Float

-1

Value below which the subpeaks will be clipped

Constant value for upper threshold

any

Float

1

Value above which the peaks will be clipped

Constant value of lower threshold

any

Float

-1

Value below which the subpeaks are clipped

Constant value of upper threshold

>=1

Float

1

Value above which the peaks are clipped

List of columns

 

ColumnSelection

 

List of columns to clip

Lower substitute value

Threshold

Mean

Median

Missing

SubstituteValues

Threshold

The value used for clipping subpeaks

Lower threshold

Constant

Percentile

Threshold Mode

Constant

Value below which the subpeaks will be clipped mode

Overwrite flag

TRUE/FALSE

Boolean

TRUE

Whether clipped data column(s) must overwrite input data column(s)

Percentile number for lower threshold

[1;99]

Integer

1

Percentile number below which the subpeaks will be clipped

Percentile number for upper threshold

[1;99]

Integer

99

Percentile number above which the peaks will be clipped

Percentile number of lower threshold

[1;99]

Integer

1

Percentile number below which the subpeaks are clipped

Percentile number of upper threshold

[1;99]

Integer

99

Percentile number above which the peaks are clipped

Set of thresholds

ClipPeaks

ClipSubPeaks

ClipPeaksAndSubPeaks

Threshold Set

ClipPeaks

Specifies type of threshold to use

Substitute value for peaks

Threshold

Mean

Median

Missing

SubstituteValues

Threshold

The value used during clipping peaks

Substitute value for subpeaks

Threshold

Mean

Median

Missing

SubstituteValues

Threshold

The value used during clipping subpeaks

Threshold

Constant

Percentile

Threshold Mode

Constant

Value above and below which the peaks will be clipped mode

Upper substitute value

Threshold

Mean

Median

Missing

Threshold

Threshold

The value used for clipping peaks

Upper threshold

Constant

Percentile

Threshold Mode

Constant

Value above which the peaks will be clipped mode

Name

Type

Description

Results dataset

Data Table

Dataset with clipped columns

For a list of all module errors, see Machine Learning Module Error Codes.

Exception

Description

Error 0011

Exception occurs if passed column set argument does not apply to any of dataset columns.

Error 0017

Exception occurs if one or more specified columns have type unsupported by current module.

Show: