Feature Hashing

 

Updated: June 12, 2017

Converts text data to integer encoded features using the Vowpal Wabbit library

Category: Text Analytics

You can use the Feature Hashing module to transform a stream of English text into a set of features represented as integers. You can then pass this hashed feature set to a machine learning algorithm to train a text analysis model.

The feature hashing functionality provided in this module is based on the Vowpal Wabbit framework. For more information, see Train Vowpal Wabbit 7-4 Model or Train Vowpal Wabbit 7-10 Model.

How Feature Hashing Works

For example, suppose you have a set of sentences like this one, with some text and a sentiment score that you want to use in building a model.

USERTEXTSENTIMENT
I loved this book3
I hated this book1
This book was great3
I love books2

Internally, the Feature Hashing module creates a dictionary of n-grams. You can control the size of the n-grams by using the N-grams property. For example, the list of bigrams for this dataset would be something like this:

TERM (bigrams)FREQUENCY
This book3
I loved1
I hated1
I love1

Note that if you choose bigrams, unigrams are also computed. Thus, if you do not perform any lexical analysis (such as stemming or truncation) the dictionary would also include single terms like these:

Term (unigrams)FREQUENCY
book3
I3
books1
was1

After the dictionary has been built, the Feature Hashing module converts all of the dictionary terms into hash values, and computes whether a feature was used in each case. For each row of text data you provide as input, the module outputs a set of columns, one column for each hashed feature.

For example, after hashing, the feature columns might look something like this:

RatingHashing feature 1Hashing feature 2Hashing feature 3
4110
5000
  • If the value in the column is 0, the row did not contains the hashed feature.
  • If the value is 1, the row did contain the feature.

In contrast, if you tried to use the text column for training as is, it would be treated as a categorical feature column, with many, many distinct values.

The advantage of using feature hashing is that you can represent text documents of variable-length as numeric feature vectors of equal-length, and achieve dimensionality reduction. Having the outputs as numeric also makes it possible to use many different machine learning methods with the data, including classification, clustering, or information retrieval. Because it replaces string comparison operations with hash lookups, Feature Hashing makes the lookup of feature weights much faster.

In addition to using feature hashing, you might want to use other methods to extract features from text. For example:

  • Use the Preprocess Text module to remove artifacts such as spelling errors, or to simplify text preparatory to hashing.
  • Use Extract Key Phrases to use natural language processing to extract phrases.
  • Use Named Entity Recognition to identify important entities.

Azure Machine Learning Studio provides a Text Classification template that guides you through using the Feature Hashing module for feature extraction.

  1. Add the Feature Hashing module to your experiment and connect the dataset that contains the text you want to analyze.

  2. For Target columns, select those text columns that you want to convert to hashed features.

    • The columns must be the string data type, and must be marked as a Feature column.

    • If you choose multiple text columns to use as inputs, it can have a huge effect on feature dimensionality. For example, if a 10-bit hash is used for a single text column, the output contains 1024 columns. If a 10-bit hash is used for two text columns, the output contains 2048 columns.

    System_CAPS_ICON_note.jpg Note

    By default, Studio will mark most text columns as features, so if you select all text columns, you might get too many columns, including many that are not actually free text. You can use the Clear feature option in Edit Metadata to keep other text columns from being hashed.

  3. Use Hashing bitsize to specify the number of bits to use when creating the hash table.

    The default bit size is 10. For many problems, this value is more than adequate, but whether this will be enough for your data depends on the size of the n-grams vocabulary in the training text. With a large vocabulary, more space might be needed to avoid collisions.

    We recommend that you try using a different number of bits for this parameter, and evaluate the performance of the machine learning solution.

  4. For N-grams, type a number that defines the maximum length of the n-grams to add to the training dictionary. An n-gram is a sequence of n words, treated as a unique unit.

    • N-grams = 1: Unigrams, or single words, only.

    • N-grams = 2: Bigrams and unigrams are created.

    • N-grams = 3: Trigrams, bigrams, and unigrams are created.

  5. Run the experiment.

Results

The output is a transformed dataset in which the original text column has been converted to multiple columns, each representing a feature in the text. Depending on how big the dictionary is, the resulting dataset can be extremely large:

Column name 1Column type 2
USERTEXTOriginal data column
SENTIMENTOriginal data column
USERTEXT - Hashing feature 1Hashed feature column
USERTEXT - Hashing feature 2Hashed feature column
USERTEXT - Hashing feature nHashed feature column
USERTEXT - Hashing feature 1024Hashed feature column

After you have created the transformed dataset, you can use it as the input to the Train Model module, together with a good classification model, such as Two-Class Support Vector Machine.

Best Practices

Some best practices that you can use while modeling text data are demonstrated in the following diagram representing an experiment

AML_FeatureHashingWorkflow

  • You might need to add an Execute R Script module before using Feature Hashing, in order to preprocess the input text. With R script, you also have the flexibility to use custom vocabularies or custom transformations.

  • You should add a Select Columns in Dataset module after the Feature Hashing module to remove the text columns from the output data set. You do not need the text columns after the hashing features have been generated.

    Alternatively, you can use the Edit Metadata module to clear the feature attribute from the text column.

  • Text processing options that you should consider using are word breaking, stop word removal, case normalization, removal of special characters, and stemming. Other type of text preprocessing can possibly improve accuracy.

    However, the optimal set of preprocessing methods to apply in any individual solution depends on domain, vocabulary, and business need. We recommend that you experiment with your data to see which custom text processing methods are most effective.

For examples of how feature hashing is used for text analytics, see these samples in the Model Gallery:

  • The News Categorization sample uses feature hashing to classify articles into a predefined list of categories.

  • The Similar Companies sample uses the text of Wikipedia articles to categorize companies.

  • In the five-part Text Classification template, text from Twitter messages is used to perform sentiment analysis.

The Feature Hashing module uses a fast machine learning framework called Vowpal Wabbit that hashes feature words into in-memory indexes, using a popular open source hash function called murmurhash3. This hash function is a non-cryptographic hashing algorithm that maps text inputs to integers, and is popular because it performs well in a random distribution of keys. Unlike cryptographic hash functions, it can be easily reversed by an adversary, so that it is unsuitable for cryptographic purposes.

The purpose of hashing is to convert variable-length text documents into equal-length numeric feature vectors, to support dimensionality reduction and make the lookup of feature weights faster.

Each hashing feature represents one or more n-gram text features (unigrams or individual words, bi-grams, tri-grams, etc.), depending on the number of bits (represented as k) and on the number of n-grams specified as parameters. It projects feature names to the machine architecture unsigned word using the murmurhash v3 (32-bit only) algorithm which then is AND-ed with (2^k)-1. That is, the hashed value is projected down to the first k lower-order bits, and the remaining bits are zeroed out. If the specified number of bits is 14, the hash table can hold 214-1 (or 16,383) entries.

For many problems, the default hash table (bitsize = 10) is more than adequate; however, depending on the size of the n-grams vocabulary in the training text, more space might be needed to avoid collisions. We recommend that you try using a different number of bits for the Hashing bitsize parameter, and evaluate the performance of the machine learning solution.

NameTypeDescription
DatasetData TableInput dataset
NameRangeTypeDefaultDescription
Target columnsAnyColumnSelectionStringFeatureChoose the columns to which hashing will be applied.
Hashing bitsize[1;31]Integer10Type the number of bits to use when hashing the selected columns
N-grams[0;10]Integer2Specify the number of N-grams generated during hashing. By default, both unigrams and bigrams are extracted
NameTypeDescription
Transformed datasetData TableOutput dataset with hashed columns
ExceptionDescription
Error 0001Exception occurs if one or more specified columns of data set couldn't be found.
Error 0003Exception occurs if one or more of inputs are null or empty.
Error 0004Exception occurs if parameter is less than or equal to specific value.
Error 0017Exception occurs if one or more specified columns have type unsupported by current module.

Text Analytics

A-Z Module List

Show: