Build Counting Transform

 

Updated: July 1, 2016

Creates a transformation that turns count tables into features, so that you can apply the transformation to multiple datasets

You can use the Build Counting Transform module to analyze training data and build a count table as well as a set of count-based features that can be used in a predictive model.

A count table contains the joint distribution of all feature columns given a specified label column. Such statistics are is useful in determining which columns have the most information value. Count-based featurization is useful because such features are more compact than the original training data, but capture all the most useful information. You can use the module parameters to customize how the counts are transformed into the new set of count-based features.

After generating counts and transforming them into features, you can save the process as a transformation for re-use on related data. You can also modify the set of features without having to generate a new set of counts, or merge the counts and features with another set of counts and features.

The ability to re-use and re-apply count-based features is useful in scenarios such as these:

  • New data becomes available to improve the coverage or balance of your dataset.

  • Your original counts and features were based on a very large dataset that you don’t want to re-process. By merging the counts you can update with new data.

  • You want to ensure that the same set of count-based features is applied to all datasets that you are using in your experiment.

You can create a count-based feature transformation directly from a dataset, and re-run it each time you run an experiment, or you can generate a set of counts, and then merge it with new data to create an updated count table.

  • Create count-based features from a dataset 

    Start here if you have not created counts before. You use the Build Counting Transform module to create count tables and automatically generate a set of features.

    This process creates a feature transformation that you can apply to a dataset, using the Apply Transformation module.

  • Merge counts and features from multiple datasets

    If you have already generated a count table from a previous dataset, generate counts on just the new data, or import an existing count table created in an earlier version of Azure Machine Learning. Then, merge the two sets of count tables.

    This process creates a new feature transformation that you can apply to a dataset, using the Apply Transformation module.

 

  1. Add the Build Counting Transform module to your experiment, and connect the dataset you want to use as the basis for our count-based features.

  2. Use the Number of classes option to specify the number of values in your label column.

    • For any binary classification problem, type 2.

    • For a classification problem with more than two possible outputs, you must specify in advance the exact number of classes to count. If you enter a number that is less than the actual number of classes, the module will return an error.

    • If your dataset contains multiple class values and the class label values are non-sequential, use Edit Metadata to specify that the column contains categorical values.

  3. For the option, The bits of hash function, indicate how many bits to use when hashing the values. It is generally safe to accept the defaults, unless you know that there are many values to count and a higher bit count might be needed.

  4. In The seed of hash function, you can optionally specify a value to seed the hashing function. This is typically used when you want to ensure that thashing results are deterministic across runs of the same experiment.

  5. Use the Module type option to indicate the type of data that you will be counting, based on the storage mode:

    • Dataset    Choose this option if you will be counting data that is saved as a dataset in Azure Machine Learning Studio.

    • Blob   Choose this option if your source data used to build counts is stored as a block blob in Windows Azure storage.

    • MapReduce    Choose this option if you want to call Map/Reduce functions to process the data. To use this option, the new data must be provided as a blob in Windows Azure storage, and you must have access to a deployed HDInsight cluster. When you run the experiment, a Map/Reduce job will be launched in the cluster to perform the counting.

      For more information, see http://azure.microsoft.com/services/hdinsight/.

      System_CAPS_tipTip

      For very large datasets, we recommend that you use this option whenever possible. Although you might incur additional costs for using the HDInsight service, computation over large datasets might be faster in HDInsight.

  6. After specifying the data storage mode, provide any additional connection information for the data that is required:

    • If you are using data from Hadoop or blob storage, provide the cluster location and credentials.

    • If you previously used a Import Data module in the experiment to access data, you must re-enter the account name and your credentials. The reason is that Build Counting Transform module accesses the data storage separately in order to read the data and build the required tables.

  7. For Label column or index, select one column as the label column.

    A label column is required. Moreover, the column must already be marked as a label or an error will be raised.

  8. Use the option, Select columns to count, and select the columns for which to generate counts.

    In general, the best candidates are high-dimensional columns and any other columns that are correlated with those columns.

  9. Use the Count table type option to specify the format used for storing the count table.

    • Dictionary.    Creates a dictionary count table. All column values in the selected columns are treated as strings, and are hashed using a bit array of up to 31 bits in size. Therefore, all column values are represented by a non-negative 32-bit integer.

      After selecting this option, configure the number of bits used by the hashing function, and set a seed for initializing the hash function.

    • CMSketch.     Creates a count minimum sketch table. With this option, multiple independent hash functions with a smaller range are used to improve memory efficiency and reduce the chance of hash collisions.

      The parameters for hashing bit size and hashing seed have no effect on this option.

    System_CAPS_tipTip

    In general, you should use the Dictionary option for smaller data sets (<1GB), and use the CMSketch option for larger datasets.

  10. Run the experiment.

    The module creates a featurization transform that you can use as input to the Apply Transformation module. The output of the Apply Transformation module is a transformed dataset that can be used to train a model.

    Optionally, you can save the transform if you want to merge the set of count-based features with another set of count-based features. For more information, see Merge Count Transform.

  1. Add the Build Counting Transform module to your experiment, and connect the dataset that contains the new data you want to add.

  2. Use the Module type option to indicate the source of the new data. You can merge data from different sources.

    • Dataset    Choose this option if the new data is provided as a dataset in Azure Machine Learning Studio.

    • Blob   Choose this option if the new data is provided as a block blob in Windows Azure storage.

    • MapReduce    Choose this option if you want to call Map/Reduce functions to process the data. To use this option, the new data must be provided as a blob in Windows Azure storage, and you must have access to a deployed HDInsight cluster. When you run the experiment, a Map/Reduce job will be launched in the cluster to perform the counting.

      For more information, see http://azure.microsoft.com/services/hdinsight/.

  3. After specifying the data storage mode, provide any additional connection information for the new data :

    • If you are using data from Hadoop or blob storage, provide the cluster location and credentials.

    • If you previously used a Import Data module in the experiment to access data, you must re-enter the account name and your credentials. The reason is that Build Counting Transform module accesses the data storage separately in order to read the data and build the required tables.

  4. When merging counts, the following options must be exactly the same in both counts tables:

    • Number of classes

    • The bits of hash function

    • The seed of hash function

    • Select columns to count

    The label column can be different, as long as it contains the same number of classes.

  5. Use the Count table type option to specify the format and destination for the updated count table.

    System_CAPS_tipTip

    The format of the two count tables that you intend to merge must be the same. In other words, if you saved an earlier count table using the Dictionary format you won't be able to merge it with counts saved using the CMSketch format.

  6. Run the experiment.

    The module creates a featurization transform that you can use as input to the Apply Transformation module. The output of the Apply Transformation module is a transformed dataset that can be used to train a model.

  7. To merge this set of counts with an existing set of count-based features, see Merge Count Transform.

See these articles for more information about the counts algorithm and the efficacy of count-based modeling compared to other methods.

The following experiments in the Cortana Intelligence Gallery demonstrate how to use count-based learning to build various predictive models:

 

Name

ToHide

Type

Range

Optional

Description

Default

Number of classes

commonNumClasses

Integer

>=2

Required

2

The number of classes for the label.

The bits of hash function

commonHashBits

Integer

[12;31]

Required

20

The number of bits of the range of hash function.

The seed of hash function

commonHashSeed

Integer

Required

1

The seed for the hash function.

Module type

countingType

CountingType

Required

Dataset

The type of build counting module.

Blob name

azureBlobName

String

countingType:Blob

The name of the input blob. Do not include container name.

Account name

azureStorageAccountName

String

countingType:Blob

The name of the storage account.

Account key

azureStorageAccountKey

SecureString

countingType:Blob

The key of the storage account.

Container name

azureContainerName

String

countingType:Blob

The Azure blob container that contains the input blob.

Count columns

azureCols

String

countingType:Blob

The one-based indexes of groups of columns to perform counting.

Label column

azureLabelCol

Integer

>=1

countingType:Blob

1

The one-based index of the label column.

Blob format

azureBlobFormat

DraculaBlobFormat

countingType:Blob

CSV

The blob text file format.

Count table type

azureCountTableType

CountTableType

countingType:Blob

Dictionary

Type of the count table.

Depth of CM sketch table

azureDepth

Integer

>=1

azureCountTableType:CMSketch

4

The depth of the CM sketch table, which equals the number of hash functions.

Width of CM sketch table

azureWidth

Integer

[1;31]

azureCountTableType:CMSketch

20

The width of the CM sketch table, which is the number of bits of the range of hash function.

Label column index or name

dataLabelCol

ColumnSelection

countingType:Dataset

Selects the label column.

Select columns to count

dataColSelection

ColumnSelection

countingType:Dataset

Selects columns for counting. These columns are considered as categorical features.

Count table type

dataCountTableType

CountTableType

countingType:Dataset

Dictionary

Specifies the type of the count table.

Depth of CM sketch table

dataDepth

Integer

>=1

dataCountTableType:CMSketch

4

The CM sketch table depth, which equals the number of hash functions.

Width of CM sketch table

dataWidth

Integer

[1;31]

dataCountTableType:CMSketch

20

The CM sketch table width, which is the number of bits of the range of hash function.

Default storage account name

hadoopStorageAccountName

String

countingType:MapReduce

The name of the storage account containing the input blob.

Default storage account key

hadoopStorageAccountKey

SecureString

countingType:MapReduce

The key of the storage account containing the input blob.

Default container name

hadoopContainerName

String

countingType:MapReduce

The name of the blob container to write the count table.

Cluster URI

hadoopHdInsightClusterUri

String

countingType:MapReduce

The URI to the HDInsight Hadoop cluster.

Username

hadoopClusterUsername

String

countingType:MapReduce

The username to login to the HDInsight Hadoop cluster.

Password

hadoopClusterPassword

SecureString

countingType:MapReduce

The password to login to the HDInsight Hadoop cluster.

Number of reducers

hadoopNumberOfReducers

Integer

>=1

countingType:MapReduce

10

The number of reducers to deploy.

Input data container

hadoopInputContainer

HadoopInputContainer

countingType:MapReduce

DefaultContainer

The container of the input blob.

The URI to the input blob container

hadoopInputContainerUri

String

hadoopInputContainer:ExternalContainer

The URI to the input blob container. The container should have public read access.

Input blob name

hadoopInput

String

countingType:MapReduce

The input blob name for MapReduce counting.

Output blob path

hadoopOutput

String

countingType:MapReduce

The output blob path for the count table.

Count columns

hadoopCols

String

countingType:MapReduce

One-based indexes of groups of columns to count.

Label column

hadoopLabelCol

Integer

>=1

countingType:MapReduce

1

One-based index of the label column.

Blob format

hadoopBlobFormat

DraculaBlobFormat

countingType:MapReduce

CSV

The blob file format.

Name

Type

Description

Counting transform

ITransform interface

The counting transform.

Exception

Description

Error 0003

Exception occurs if one or more of inputs are null or empty.

Error 0004

Exception occurs if parameter is less than or equal to specific value.

Error 0005

Exception occurs if parameter is less than a specific value.

Error 0007

Exception occurs if parameter is greater than a specific value.

Error 0009

Exception occurs if Azure storage account name or container name specified incorrectly.

Error 0065

Exception occurs if Azure blob name is specified incorrectly.

Error 0011

Exception occurs if passed column set argument does not apply to any of dataset columns.

Error 0049

Exception occurs in the case when it is not possible to parse a file.

Error 1000

Internal library exception.

Error 0059

Exception occurs if a column index specified in a column picker cannot be parsed.

Error 0060

Exception occurs when an out of range column range is specified in a column picker.

Error 0089

Exception occurs when the specified number of classes is less than the actual number of classes in a dataset used for counting.

Show: