Modify Count Table Parameters
Updated: August 25, 2016
Modifies the parameters used to create features from counts
You can use the Modify Count Table Parameters module to change the way that features are generated from a count table.
In general, to create count-based features, you use Build Counting Transform to process a dataset and create a count table, and from that count table generate a new set of features. However, if you have already created the count table, you can use the Modify Count Table Parameters module to edit the definition of how the count data is processed, to create a new set of count-based statistics based on already processed data, without having to re-analyze the dataset.
Locate the transformation you want to modify, in the Transforms group, and add it to your experiment.
You should have previously run an experiment that created a count transformation.
To modify a saved transform
Locate the transformation, in the Transforms group, and add it to your experiment.
To modify a count transformation created within the same experiment
If the transformation has not been saved, but is available as an output in the current experiment (for example, check the output of the Build Counting Transform module), you can use it directly by connecting the modules.
Add the Modify Count Table Parameters module and connect the transformation as an input.
In the Properties pane of the Modify Count Table Parameters module, type a value to use as theGarbage bin threshold.
This value specifies the minimum number of occurrences that must be found for each feature value, in order for counts to be used. If the frequency of the value is less than the garbage bin threshold, the value-label pair is not counted as a discrete item; instead, all items with counts lower than the threshold value are placed in a single "garbage bin".
Tip If you are using a small dataset and you are counting and training on the same data, a good starting value is 1.
for Additional prior pseudo examples, type a number that indicates the number of additional pseudo examples to include.
You do not need to provide these examples; the pseudo examples are generated based on the prior distribution.
For Laplacian noise scale, type a positive floating-point value that represents the scale used for introducing noise sampled from a Laplacian distribution. In other words, by setting a scale value, some acceptable level of noise is incorporated into the model, so the model is less likely to be affected by unseen values in data.
In Output features include, choose the method to use when creating count-based features for inclusion in the transformation.
CountsOnly. Create features using counts.
LogOddsOnly. Create features using the log of the odds ratio.
BothCountsAndLogOdds. Create features using both counts and log odds.
Select the Ignore back off column option if you want to override the IsBackOff flag in the output when creating features
When you select this option, count-based features will be created even if the column doesn’t have significant count values.
Run the experiment. You can then save the output of Modify Count Table Parameters as a new transformation, if desired.
You can see examples of how this module is used by exploring these sample experiments in the Model Gallery:
The Learning with Counts: Binary Classification sample demonstrates how to use the learning with counts modules to generate features from columns of categorical values for a binary classification model.
The Learning with Counts: Multiclass classification with NYC taxi data sample demonstrates how to use the learning with counts modules for performing multiclass classification on the publicly available NYC taxi dataset. The sample uses a multiclass logistic regression learner to model this problem.
The Learning with Counts: Binary classification with NYC taxi data sample demonstrates how to use the learning with counts modules for performing binary classification on the publicly available NYC taxi dataset. The sample uses a two-class logistic regression learner to model this problem.
It is statistically safe to count and train on the same data set if you set the Laplacian noise scale parameter.
Name | Type | Description |
|---|---|---|
Counting transform | The counting transform to apply. |
Name | ToHide | Type | Range | Optional | Description | Default |
|---|---|---|---|---|---|---|
Garbage bin threshold | garbageBinThreshold | Float | >=0.0f | Required | 10.0f | The threshold under which a column value will be featurized against the garbage bin. |
Additional prior pseudo examples | priorEx | Float | >=0.0f | Required | 42.0f | The additional pseudo examples following prior distributions to be included. |
Laplacian noise scale | noiseScale | Float | >=0.0f | Required | 0.0f | The scale of the Laplacian distribution from which noise is sampled. |
Output features include | outputFeatureInclude | OutputFeatureType | Required | BothCountsAndLogOdds | The features to output. | |
Ignore back off column | ignoreBackOff | Boolean | Required | false | Whether to ignore the IsBackOff column in the output. |
Name | Type | Description |
|---|---|---|
Modified transform | The modified transform. |
Exception | Description |
|---|---|
Exception occurs if one or more of inputs are null or empty. | |
Exception occurs when a counting transform is invalid. |