Latent Dirichlet Allocation (temporary)
Use the Vowpal Wabbit library to classifiy text using latent Dirichlet allocation
Category: Data Transformation / Filter
Module Overview
You can use the Latent Dirichlet Allocation module to group text into a number of categories.
Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to group previously unclassified sets of observations by similarities. LDA is a generative model, not a classification model, meaning that you don’t start with known labels and then infer the patterns that create the group labels, but rather generate a probabilistic topic model that you can use to classify both existing and new instances. A generative model is useful because it avoids making any strong assumptions about the relationship between the text and categories. Instead, it uses a distribution of words to mathematically model topics.
To use this module, you pass in a dataset that contains a column of text, either raw or preprocessed, and indicate how many categories you want to extract from the text. You can also set options for how you want punctuation handled, how large the terms are that you are extracting, and so forth.
LDA then uses Bayes theorem to determine what topics might be associated with individual words. The words are not exclusively associated with groups; instead, each n-gram has a learned probability of being associated with any of the discovered classes.
The module outputs:
The source text with a score for each category
A feature matrix containing extracted terms and coefficients for each category
A transformation that you can save and reapply to new text used as input
This particular implementation of LDA uses the Vowpal Wabbit library and therefore is very fast. For more information about Vowpal Wabbit, see Vowpal Wabbit Train.
How to Configure LDA
To use , you must provide a dataset containing one or more text columns.
You can configure the behavior of the Vowpal Wabbit implementation of LDA by using these parameters:
- Target columns
Use the Column Selector to choose the columns for analysis. You can choose multiple columns but they must be of the string data type.
- Number of topics to model
Type the number of categories or topics that you want to derive from the input text.
- Rho parameter
Specify a prior probability for the sparsity of topic distributions
- Alpha parameter
Specify a prior probability for the sparsity of per-document topic weights
- Size of the batch
Type an integer value that indicates the number of rows to pass in each batch of text sent to Vowpal Wabbit
N-grams
Type a number that specifies the maximum length of N-grams generated during hashing.By default, bigrams and unigrams are generated.
- Number of passes over the data
Specify the number of times the algorithm will cycle over the data (epochs).
Delimiters: basic punctuation
Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.Basic punctuation includes these characters:
Delimiters: white space characters
Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.Basic punctuation includes these characters:
Delimiters: controls
Select this option if you want to discard control characters; that is, control sequences should be not be treated as meaningful tokens in text.Control sequences includes these characters:
Delimiters: basic brackets
Select this option if you want to discard brackets; that is, brackets should not be treated as meaningful tokens in text.Basic brackets includes these characters:
Custom delimiters
Type a list of other characters to use as delimiters.Spaces between characters are optional and will be ignored.
Estimated number of documents
Provide an estimate of the number of documents (rows) that will be processed.Corresponds to the
lda_D
parameter in Vowpal Wabbit
Initial value of iteration count
Specify the initial number of iterations to use when updating the learning rate on a schedule.Corresponds to the
initial_t
parameter in Vowpal Wabbit
Power applied to the iteration during updates
Specify the level of power applied to the iteration count during online updates.Corresponds to the
power_t
parameter in Vowpal Wabbit
The following parameters are optional or deprecated.
-
For additional information, see the documentation for LDA in the Vowpal Wabbit repository.
Examples
For examples of how text analytics, see these experiments in the Model Gallery:
- The Execute Python Script sample uses natural language processing in Python to clean and transform text.
Technical Notes
Introduction to LDA
LDA is an algorithm that is commonly used for content-based topic modeling, a method used for learning categories from unclassified text. In content-based topic modeling:
Each topic is a distribution over words.
For example, the topic of a product’s review by customers contains many terms for which you can measure a probability distribution over time.
Each document is a mixture of corpus-wide topics.
That is, in our product review example, terms are typically not exclusive to one product, but can refer to other products, or be general terms that apply to everything (“great”, “awful”) or be noise words.
Each word is drawn from one of these topics.
The method does not attempt to capture all words in the universe, only those in the target domain.
A distance-based similarity measure is used to determine whether two pieces of text are like each other.
For example, you might find that the product has multiple names which are strongly correlated. Or, you might find that strongly negative terms are usually associated with a particular product. You can use the similarity measure both to identify related terms and to create recommendations.
Understanding LDA Results
The module returns multiple results. To illustrate the results, you can apply LDA to a simple list of movie titles, like the one shown in the following table.
Movie name |
---|
The Cheat (1915) |
The Fireman (1916) |
The Floorwalker (1916) |
The Rink (1916) |
Easy Street (1917) |
The Immigrant (1917) |
The module automatically tokenizes the sentence and strips punctuation, based on parameters you supply, and generates the following results:
Transformed dataset. Contains the input text, and a specified number of discovered categories, together with the scores for each text example for each category.
For example, if you use default settings, LDA creates 5 categories, and assigns a distance score to each movie title for each topic:
Movie name Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 The Cheat (1915)
0.018182
0.018182
0.018182
0.018182
0.927272
The Fireman (1916)
0.192446
0.018182
0.018182
0.018182
0.753007
The Floorwalker (1916)
0.013334
0.013334
0.821866
0.013334
0.138133
The Rink (1916)
0.013333
0.013333
0.013333
0.013333
0.946666
Easy Street (1917)
0.028572
0.028572
0.028572
0.028572
0.885713
The Immigrant (1917)
0.018182
0.018182
0.018182
0.018182
0.927272
Feature topic matrix
In this output, the features are the tokenized words, in Col1. The remaining columns contain the categories that you specified. (Note the shift in column index values; you might want to use Metadata Editor to renamethe columns to avoid confusion.)
Each word is accompanied by a score that indicates its coefficient for that particular category.
Col1 Col2 Col3 Col4 Col5 Col 6 1917
1.22412
1.208375
0.060055
1.209502
3.431797
The
120.8806
0.010696
555.0803
0.012517
447.5839
Cheat
11.76316
16.06443
16.31221
7.404633
167.5966
Fireman
7.410565
11.40886
170.7509
16.94104
7.884609
Floorwalk
6.601284
9.910877
7.362141
28.30476
173.0255
LDA transformation
The module also outputs the transformation that applies LDA to the dataset, as an ITransform interface.
You can save this transformation and re-use it for other datasets.
This is useful if you have trained on a large corpus and want to reuse the coefficients.
Refining the LDA Model
Because each task has unique requirements and each corpus has different characteristics in terms of the distribution of terms, typically you cannot create a single LDA model that will meet all needs, but must tune the model parameters, get customer feedback, and use visualization to understand the results.
You might also use qualitative measures to assess the models, such as:
Accuracy. Are similar items really similar?
Diversity. Can the model discriminate between similar items when required for the business problem?
Scalability. Does it work on a wide range of text categories or only on a narrow target domain?
Performance of models based on LDA might be improved by using natural language processing to simply, clean, or categorize text. For example, the following techniques might be used to improve classification accuracy:
Stop word removal
Case normalization
Lemmatization or stemming
Named entity recognition
See the Model Gallery for examples of experiments that use natural language processing in Python, feature hashing, and other text processing techniques.
See Also
Concepts
Text Analytics
Feature Hashing
Named Entity Recognition
Vowpal Wabbit Score
Vowpal Wabbit Train