Clip Values
Updated: June 21, 2016
Detects outliers and clips or replaces their values
Category: Data Transformation / Scale and Reduce
You can use the Clip Values module to identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, or threshold value.
There are multiple ways to identify values for clipping: you can set a value to use as the upper or lower boundary, or you can specify a percentile range. All values outside the range are replaced or removed using the method you specify.
Add the Clip Values module to your experiment and connect it to the dataset you want to modify.
If you want to modify only a few columns, you can use Select Columns in Dataset to choose a subset of columns.
In List of columns, use the Column Selector to choose the columns to which Clip Values will be applied.
Note that you can apply only one kind of transformation at a time. Therefore, you might need to use multiple instances of Clip Values to adjust different types of numeric data.
Indicate the method that you want to apply when identifying target values.
ClipPeaks
If you clip values by peaks, you specify only an upper boundary. Values greater than that boundary value are replaced or removed.
ClipSubpeaks
If you clip only by sub-peaks, you specify only a lower boundary. Values that are less than that boundary value are replaced or removed.
ClipPeaksAndSubpeaks
If you clip values by peaks and sub-peaks, you can specify both the upper and lower boundaries. Values that are outside that range are replaced or removed. Values that match the boundary values are not changed.
Specify the value used to define the threshold or boundary. Values can be defined as constants or as percentiles:
Constant . Choose this option to specify a single, hard-coded maximum or minimum value.
For example, you might know that 999 has been used as a placeholder value and clip that value, replacing it with a null.
Percentile . Choose this option to constrain the column values to a percentile range. Then, type the upper or lower percentile range values as required. See the section on percentiles for some examples of how these values work.
For example, you might want to keep only the values in the 10-80 percentile range, and replace all others with a mean.
Tip See the section on percentiles for some examples of how to use percentile ranges.
Depending on the clipping option you selected, there are multiple options for defining an upper boundary, lower boundary, or both.
Upper threshold. Select the type of value to use for upper boundary: a constant, or percentile. Then, set the value:
Constant value of upper threshold. Type the exact value for the upper boundary.
Percentile number of upper threshold. Type an integer value representing the upper percentile boundary.
Lower threshold. Select the type of value to use for lower boundary: a constant, or percentile. Then, set the value:
Constant value of lower threshold. Type the exact value for the upper boundary.
Percentile value for lower threshold. Type an integer value representing the lower percentile boundary.
Tip Values that exactly match the boundary value are considered in the range and will not be replaced or removed.
Specify a replacement value that will be used for values outside the specified range.
Substitute value for peaks. Provide the value to substitute for all column values that are greater than the specified threshold.
Substitute value for subpeaks. Provide the value to substitute for all column values that are less than the specified threshold.
If you use the ClipPeaksAndSubpeaks option, you can specify separate replacement values for the upper and lower clipped values.
Lower substitute value. Provide the value to substitute for all column values that are less than the specified threshold.
Upper substitute value. Provide the value to substitute for all column values that are greater than the specified threshold. You can use any of these options for the substitute value:
The following types of replacement value are supported:
Threshold. Replace clipped values with the specified threshold value.
Mean. Replace clipped values with the mean of the column values. The mean is computed before values are clipped.
Median. Replace clipped values with the median of the column values before clipping.
Missing. Replace clipped values with the missing (empty) value.
Select the option, Add indicator columns, if you want to generate an indicator column that tells you whether or not a clipping operation would be applied to the data in each row. This is particularly handy when you are testing a new set of clipping and substitution values.
For Overwrite flag, indicate how you want the new values to be generated. By default, Clip Values constructs a new column with the peak values clipped to the desired threshold. New values overwrite the original column.
To keep the original column and add a new column with the clipped values, deselect this option.
Run the experiment.
Right-click the output of the Clip Values module and select Visualize to review the values and make sure the clipping operation met your expectations.
To understand how clipping by percentiles works, consider a dataset with 10 rows, which have one instance each of the values 1-10.
If you are using percentile as the upper threshold, at the value for the 90th percentile, 90 percent of all values in the dataset must be less than that value.
If you are using percentile as the lower threshold, at the value for the 10th percentile, 10 percent of all values in the dataset must be less than that value.
For Set of thresholds, choose ClipPeaksAndSubPeaks.
For Upper threshold, choose Percentile, and for Percentile number, type 90.
For Upper substitute value, choose Missing Value.
For Lower threshold, choose Percentile, and for Percentile number, type 10.
For Lower substitute value, choose Missing Value.
Deselect the option Overwrite flag, and select the option, Add indicator column.
Now try the same experiment using 60 as the upper percentile threshold and 30 as the lower percentile threshold, and use the threshold value as the replacement value. The following table compares these two results:
Replace with missing; Upper threshold = 90; Lower threshold = 10
Replace with threshold; Upper percentile = 60; Lower percentile = 30
Original data | Replace with missing | Replace with threshold | ||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
You can use Clip Values only on columns containing numbers or date/time values.
If you include columns that have text or categorical data, the columns will be skipped.
Missing values are ignored when the mean or median value is computed for a column.
Clip Values does not support ordinal data.
Missing values are not altered when they are propagated to the output dataset. The column indicating clipped values always contains FALSE for missing values.
Name | Type | Description |
|---|---|---|
Dataset | Input dataset |
Name | Range | Type | Default | Description |
|---|---|---|---|---|
Add indicator columns | TRUE/FALSE | Boolean | FALSE | Whether to add indicator for clipping of a value is done |
Constant value for lower threshold | any | Float | -1 | Value below which the subpeaks will be clipped |
Constant value for upper threshold | any | Float | 1 | Value above which the peaks will be clipped |
Constant value of lower threshold | any | Float | -1 | Value below which the subpeaks are clipped |
Constant value of upper threshold | >=1 | Float | 1 | Value above which the peaks are clipped |
List of columns |
| ColumnSelection |
| List of columns to clip |
Lower substitute value | Threshold Mean Median Missing | SubstituteValues | Threshold | The value used for clipping subpeaks |
Lower threshold | Constant Percentile | Threshold Mode | Constant | Value below which the subpeaks will be clipped mode |
Overwrite flag | TRUE/FALSE | Boolean | TRUE | Whether clipped data column(s) must overwrite input data column(s) |
Percentile number for lower threshold | [1;99] | Integer | 1 | Percentile number below which the subpeaks will be clipped |
Percentile number for upper threshold | [1;99] | Integer | 99 | Percentile number above which the peaks will be clipped |
Percentile number of lower threshold | [1;99] | Integer | 1 | Percentile number below which the subpeaks are clipped |
Percentile number of upper threshold | [1;99] | Integer | 99 | Percentile number above which the peaks are clipped |
Set of thresholds | ClipPeaks ClipSubPeaks ClipPeaksAndSubPeaks | Threshold Set | ClipPeaks | Specifies type of threshold to use |
Substitute value for peaks | Threshold Mean Median Missing | SubstituteValues | Threshold | The value used during clipping peaks |
Substitute value for subpeaks | Threshold Mean Median Missing | SubstituteValues | Threshold | The value used during clipping subpeaks |
Threshold | Constant Percentile | Threshold Mode | Constant | Value above and below which the peaks will be clipped mode |
Upper substitute value | Threshold Mean Median Missing | Threshold | Threshold | The value used for clipping peaks |
Upper threshold | Constant Percentile | Threshold Mode | Constant | Value above which the peaks will be clipped mode |
Name | Type | Description |
|---|---|---|
Results dataset | Dataset with clipped columns |
For a list of all module errors, see Machine Learning Module Error Codes.
Exception | Description |
|---|---|
Exception occurs if passed column set argument does not apply to any of dataset columns. | |
Exception occurs if one or more specified columns have type unsupported by current module. |