Latent Dirichlet Allocation (temporary)

Article
02/17/2015

Use the Vowpal Wabbit library to classifiy text using latent Dirichlet allocation

Module Overview

You can use the Latent Dirichlet Allocation module to group text into a number of categories.

Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to group previously unclassified sets of observations by similarities. LDA is a generative model, not a classification model, meaning that you don’t start with known labels and then infer the patterns that create the group labels, but rather generate a probabilistic topic model that you can use to classify both existing and new instances. A generative model is useful because it avoids making any strong assumptions about the relationship between the text and categories. Instead, it uses a distribution of words to mathematically model topics.

To use this module, you pass in a dataset that contains a column of text, either raw or preprocessed, and indicate how many categories you want to extract from the text. You can also set options for how you want punctuation handled, how large the terms are that you are extracting, and so forth.

LDA then uses Bayes theorem to determine what topics might be associated with individual words. The words are not exclusively associated with groups; instead, each n-gram has a learned probability of being associated with any of the discovered classes.

The module outputs:

The source text with a score for each category
A feature matrix containing extracted terms and coefficients for each category
A transformation that you can save and reapply to new text used as input

This particular implementation of LDA uses the Vowpal Wabbit library and therefore is very fast. For more information about Vowpal Wabbit, see Vowpal Wabbit Train.

How to Configure LDA

To use , you must provide a dataset containing one or more text columns.

You can configure the behavior of the Vowpal Wabbit implementation of LDA by using these parameters:

Target columns
Use the Column Selector to choose the columns for analysis. You can choose multiple columns but they must be of the string data type.

Number of topics to model
Type the number of categories or topics that you want to derive from the input text.

Rho parameter
Specify a prior probability for the sparsity of topic distributions

Alpha parameter
Specify a prior probability for the sparsity of per-document topic weights

Size of the batch
Type an integer value that indicates the number of rows to pass in each batch of text sent to Vowpal Wabbit

N-grams
Type a number that specifies the maximum length of N-grams generated during hashing.

By default, bigrams and unigrams are generated.

Number of passes over the data
Specify the number of times the algorithm will cycle over the data (epochs).

Delimiters: basic punctuation
Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.

Basic punctuation includes these characters:

Delimiters: white space characters
Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.

Basic punctuation includes these characters:

Delimiters: controls
Select this option if you want to discard control characters; that is, control sequences should be not be treated as meaningful tokens in text.

Control sequences includes these characters:

Delimiters: basic brackets
Select this option if you want to discard brackets; that is, brackets should not be treated as meaningful tokens in text.

Basic brackets includes these characters:

Custom delimiters
Type a list of other characters to use as delimiters.

Spaces between characters are optional and will be ignored.

Estimated number of documents
Provide an estimate of the number of documents (rows) that will be processed.

Corresponds to the lda_D parameter in Vowpal Wabbit

Initial value of iteration count
Specify the initial number of iterations to use when updating the learning rate on a schedule.

Corresponds to the initial_t parameter in Vowpal Wabbit

Power applied to the iteration during updates
Specify the level of power applied to the iteration count during online updates.

Corresponds to the power_t parameter in Vowpal Wabbit

The following parameters are optional or deprecated.

For additional information, see the documentation for LDA in the Vowpal Wabbit repository.

Examples

For examples of how text analytics, see these experiments in the Model Gallery:

The Execute Python Script sample uses natural language processing in Python to clean and transform text.

Technical Notes

Introduction to LDA

LDA is an algorithm that is commonly used for content-based topic modeling, a method used for learning categories from unclassified text. In content-based topic modeling:

Each topic is a distribution over words.

For example, the topic of a product’s review by customers contains many terms for which you can measure a probability distribution over time.
Each document is a mixture of corpus-wide topics.

That is, in our product review example, terms are typically not exclusive to one product, but can refer to other products, or be general terms that apply to everything (“great”, “awful”) or be noise words.
Each word is drawn from one of these topics.

The method does not attempt to capture all words in the universe, only those in the target domain.
A distance-based similarity measure is used to determine whether two pieces of text are like each other.

For example, you might find that the product has multiple names which are strongly correlated. Or, you might find that strongly negative terms are usually associated with a particular product. You can use the similarity measure both to identify related terms and to create recommendations.

Understanding LDA Results

The module returns multiple results. To illustrate the results, you can apply LDA to a simple list of movie titles, like the one shown in the following table.

Movie name
The Cheat (1915)
The Fireman (1916)
The Floorwalker (1916)
The Rink (1916)
Easy Street (1917)
The Immigrant (1917)

The module automatically tokenizes the sentence and strips punctuation, based on parameters you supply, and generates the following results:

Transformed dataset. Contains the input text, and a specified number of discovered categories, together with the scores for each text example for each category.

For example, if you use default settings, LDA creates 5 categories, and assigns a distance score to each movie title for each topic:

Movie name	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
The Cheat (1915)	0.018182	0.018182	0.018182	0.018182	0.927272
The Fireman (1916)	0.192446	0.018182	0.018182	0.018182	0.753007
The Floorwalker (1916)	0.013334	0.013334	0.821866	0.013334	0.138133
The Rink (1916)	0.013333	0.013333	0.013333	0.013333	0.946666
Easy Street (1917)	0.028572	0.028572	0.028572	0.028572	0.885713
The Immigrant (1917)	0.018182	0.018182	0.018182	0.018182	0.927272

Feature topic matrix

In this output, the features are the tokenized words, in Col1. The remaining columns contain the categories that you specified. (Note the shift in column index values; you might want to use Metadata Editor to renamethe columns to avoid confusion.)

Each word is accompanied by a score that indicates its coefficient for that particular category.

Col1	Col2	Col3	Col4	Col5	Col 6
1917	1.22412	1.208375	0.060055	1.209502	3.431797
The	120.8806	0.010696	555.0803	0.012517	447.5839
Cheat	11.76316	16.06443	16.31221	7.404633	167.5966
Fireman	7.410565	11.40886	170.7509	16.94104	7.884609
Floorwalk	6.601284	9.910877	7.362141	28.30476	173.0255

LDA transformation

The module also outputs the transformation that applies LDA to the dataset, as an ITransform interface.

You can save this transformation and re-use it for other datasets.

This is useful if you have trained on a large corpus and want to reuse the coefficients.

Refining the LDA Model

Because each task has unique requirements and each corpus has different characteristics in terms of the distribution of terms, typically you cannot create a single LDA model that will meet all needs, but must tune the model parameters, get customer feedback, and use visualization to understand the results.

You might also use qualitative measures to assess the models, such as:

Accuracy. Are similar items really similar?
Diversity. Can the model discriminate between similar items when required for the business problem?
Scalability. Does it work on a wide range of text categories or only on a narrow target domain?

Performance of models based on LDA might be improved by using natural language processing to simply, clean, or categorize text. For example, the following techniques might be used to improve classification accuracy:

Stop word removal
Case normalization
Lemmatization or stemming
Named entity recognition

See the Model Gallery for examples of experiments that use natural language processing in Python, feature hashing, and other text processing techniques.