Latent Dirichlet Allocation (temporary)

Use the Vowpal Wabbit library to classifiy text using latent Dirichlet allocation

Category: Data Transformation / Filter

Module Overview

You can use the Latent Dirichlet Allocation module to group text into a number of categories.

Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to group previously unclassified sets of observations by similarities. LDA is a generative model, not a classification model, meaning that you don’t start with known labels and then infer the patterns that create the group labels, but rather generate a probabilistic topic model that you can use to classify both existing and new instances. A generative model is useful because it avoids making any strong assumptions about the relationship between the text and categories. Instead, it uses a distribution of words to mathematically model topics.

To use this module, you pass in a dataset that contains a column of text, either raw or preprocessed, and indicate how many categories you want to extract from the text. You can also set options for how you want punctuation handled, how large the terms are that you are extracting, and so forth.

LDA then uses Bayes theorem to determine what topics might be associated with individual words. The words are not exclusively associated with groups; instead, each n-gram has a learned probability of being associated with any of the discovered classes.

The module outputs:

  • The source text with a score for each category

  • A feature matrix containing extracted terms and coefficients for each category

  • A transformation that you can save and reapply to new text used as input

This particular implementation of LDA uses the Vowpal Wabbit library and therefore is very fast. For more information about Vowpal Wabbit, see Vowpal Wabbit Train.

How to Configure LDA

To use , you must provide a dataset containing one or more text columns.

You can configure the behavior of the Vowpal Wabbit implementation of LDA by using these parameters:

  • Target columns
    Use the Column Selector to choose the columns for analysis. You can choose multiple columns but they must be of the string data type.
  • Number of topics to model
    Type the number of categories or topics that you want to derive from the input text.
  • Rho parameter
    Specify a prior probability for the sparsity of topic distributions
  • Alpha parameter
    Specify a prior probability for the sparsity of per-document topic weights
  • Size of the batch
    Type an integer value that indicates the number of rows to pass in each batch of text sent to Vowpal Wabbit
  • N-grams
    Type a number that specifies the maximum length of N-grams generated during hashing.

    By default, bigrams and unigrams are generated.

  • Number of passes over the data
    Specify the number of times the algorithm will cycle over the data (epochs).
  • Delimiters: basic punctuation
    Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.

    Basic punctuation includes these characters:

  • Delimiters: white space characters
    Select this option if you want to discard basic punctuation; that is, not treat punctuation as meaningful tokens in text.

    Basic punctuation includes these characters:

  • Delimiters: controls
    Select this option if you want to discard control characters; that is, control sequences should be not be treated as meaningful tokens in text.

    Control sequences includes these characters:

  • Delimiters: basic brackets
    Select this option if you want to discard brackets; that is, brackets should not be treated as meaningful tokens in text.

    Basic brackets includes these characters:

  • Custom delimiters
    Type a list of other characters to use as delimiters.

    Spaces between characters are optional and will be ignored.

  • Estimated number of documents
    Provide an estimate of the number of documents (rows) that will be processed.

    Corresponds to the lda_D parameter in Vowpal Wabbit

  • Initial value of iteration count
    Specify the initial number of iterations to use when updating the learning rate on a schedule.

    Corresponds to the initial_t parameter in Vowpal Wabbit

  • Power applied to the iteration during updates
    Specify the level of power applied to the iteration count during online updates.

    Corresponds to the power_t parameter in Vowpal Wabbit

The following parameters are optional or deprecated.

-
 

For additional information, see the documentation for LDA in the Vowpal Wabbit repository.

Examples

For examples of how text analytics, see these experiments in the Model Gallery:

  • The Execute Python Script sample uses natural language processing in Python to clean and transform text.

Technical Notes

Introduction to LDA

LDA is an algorithm that is commonly used for content-based topic modeling, a method used for learning categories from unclassified text. In content-based topic modeling:

  • Each topic is a distribution over words.

    For example, the topic of a product’s review by customers contains many terms for which you can measure a probability distribution over time.

  • Each document is a mixture of corpus-wide topics.

    That is, in our product review example, terms are typically not exclusive to one product, but can refer to other products, or be general terms that apply to everything (“great”, “awful”) or be noise words.

  • Each word is drawn from one of these topics.

    The method does not attempt to capture all words in the universe, only those in the target domain.

  • A distance-based similarity measure is used to determine whether two pieces of text are like each other.

    For example, you might find that the product has multiple names which are strongly correlated. Or, you might find that strongly negative terms are usually associated with a particular product. You can use the similarity measure both to identify related terms and to create recommendations.

Understanding LDA Results

The module returns multiple results. To illustrate the results, you can apply LDA to a simple list of movie titles, like the one shown in the following table.

Movie name

The Cheat (1915)

The Fireman (1916)

The Floorwalker (1916)

The Rink (1916)

Easy Street (1917)

The Immigrant (1917)

The module automatically tokenizes the sentence and strips punctuation, based on parameters you supply, and generates the following results:

  • Transformed dataset.   Contains the input text, and a specified number of discovered categories, together with the scores for each text example for each category.

    For example, if you use default settings, LDA creates 5 categories, and assigns a distance score to each movie title for each topic:

    Movie name Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

    The Cheat (1915)

    0.018182

    0.018182

    0.018182

    0.018182

    0.927272

    The Fireman (1916)

    0.192446

    0.018182

    0.018182

    0.018182

    0.753007

    The Floorwalker (1916)

    0.013334

    0.013334

    0.821866

    0.013334

    0.138133

    The Rink (1916)

    0.013333

    0.013333

    0.013333

    0.013333

    0.946666

    Easy Street (1917)

    0.028572

    0.028572

    0.028572

    0.028572

    0.885713

    The Immigrant (1917)

    0.018182

    0.018182

    0.018182

    0.018182

    0.927272

  • Feature topic matrix

    In this output, the features are the tokenized words, in Col1. The remaining columns contain the categories that you specified. (Note the shift in column index values; you might want to use Metadata Editor to renamethe columns to avoid confusion.)

    Each word is accompanied by a score that indicates its coefficient for that particular category.

    Col1 Col2 Col3 Col4 Col5 Col 6

    1917

    1.22412

    1.208375

    0.060055

    1.209502

    3.431797

    The

    120.8806

    0.010696

    555.0803

    0.012517

    447.5839

    Cheat

    11.76316

    16.06443

    16.31221

    7.404633

    167.5966

    Fireman

    7.410565

    11.40886

    170.7509

    16.94104

    7.884609

    Floorwalk

    6.601284

    9.910877

    7.362141

    28.30476

    173.0255

  • LDA transformation

    The module also outputs the transformation that applies LDA to the dataset, as an ITransform interface.

    You can save this transformation and re-use it for other datasets.

    This is useful if you have trained on a large corpus and want to reuse the coefficients.

Refining the LDA Model

Because each task has unique requirements and each corpus has different characteristics in terms of the distribution of terms, typically you cannot create a single LDA model that will meet all needs, but must tune the model parameters, get customer feedback, and use visualization to understand the results.

You might also use qualitative measures to assess the models, such as:

  1. Accuracy.   Are similar items really similar?

  2. Diversity.   Can the model discriminate between similar items when required for the business problem?

  3. Scalability.   Does it work on a wide range of text categories or only on a narrow target domain?

Performance of models based on LDA might be improved by using natural language processing to simply, clean, or categorize text. For example, the following techniques might be used to improve classification accuracy:

  • Stop word removal

  • Case normalization

  • Lemmatization or stemming

  • Named entity recognition

See the Model Gallery for examples of experiments that use natural language processing in Python, feature hashing, and other text processing techniques.

See Also

Concepts

Text Analytics
Feature Hashing
Named Entity Recognition
Vowpal Wabbit Score
Vowpal Wabbit Train