Preprocess Text

 

Updated: May 31, 2017

Performs cleaning operations on text

Category: Text Analytics

You can use the Preprocess Text module to clean and simplify text. By preprocessing the text, you can more easily create meaningful features from text. For example, the Preprocess Text module supports these common operations on text:

  • Removal of stop-words

  • Definition of regular expressions, to let you search for and replace specific target strings

  • Lemmatization, which converts multiple related words to a single canonical form

  • Filtering on specific parts of speech

  • Case normalization

  • Removal of certain classes of characters, such as numbers, special characters, and sequences of repeated characters such as "aaaa"

  • Identification and removal of emails and URLs

You can choose which cleaning options to use, and optionally specify a custom list of stop-words.

The module currently supports six languages: English, Spanish, French, Dutch, German and Italian.

  1. Add the Preprocess Text module to your experiment, and connect a dataset that has at least one column containing text.

  2. If the text you are preprocessing is all in the same language, select the language from the Language dropdown list, and the text will be proprocessed using linguistic rules specific to the selected language.

    • Dutch
    • English
    • French
    • German
    • Italian
    • Spanish
  3. If you need to preprocess text in multiple languages, choose the Column contains language option.

    You can then use the Culture-language column field to pick a column that specifies the language to use, based on an identifier such as "English" or "en". The module will then check the language identifier for each row in the dataset, and use the appropriate linguistic resources to process the text.

    If the dataset does not contain such a column, use the Detect Language module to analyze the language and generate an identifier.

    System_CAPS_ICON_tip.jpg Tip

    An error will be raised if an unsupported language is included. See the Technical Notes section for more information.

  4. Use the option, Remove by part of speech, if you want to apply part-of-speech analysis to identify classes of words and remove certain classes of words. If you set this option to True, you can choose which word classes to remove.

    • Remove nouns. Select this option to remove nouns.

    • Remove adjectives. Select this option to remove adjectives.

    • Remove verbs. Select this option to remove verbs.

    For more information about the part-of-speech identification method used, see Technical Notes.

  5. In Text column to clean, choose the column or columns that you want to preprocess.

  6. Select the Remove stop words option if you want to apply a predefined stopword list to the text column before any other processes are performed.

    Stopword lists are language dependent and customizable; for more information, see Technical Notes.

  7. Select the Lemmatization option if you want words to be represented in their canonical form. This option is useful for reducing the number of unique occurrences of otherwise similar text tokens.

    The lemmatization process is language-dependent; see the Technical Notes section for details.

  8. Select the Detect sentences option to mark sentence boundaries with a specific series of characters.

    This module uses three pipe characters ( ||| ) to represent the sentence terminator.

  9. To perform custom find-and-replace operations, you can define a target string and its replacement string, using regular expressions.

    • Use the Custom regular expression field to define the text you are searching for.

    • Use the Custom replacement string field to define a single replacement to use for all found instances.

  10. Select the option Normalize case to lowercase if you want to convert ASCII uppercase characters to their lowercase forms.

    If characters are not normalized, the same word in uppercase and lowercase letters would be considered two different words.

  11. Optionally, you can specify types of characters or character sequences to remove from the processed text.

    • Remove numbers. If you select this option, all numeric characters for the specified language are removed.

      Note that the identification of what constitutes a number is domain dependent and language dependent. If numeric characters are an integral part of a known word, the number might not be removed.

    • Remove special characters. If you select this option, any non-alphanumeric special characters will be replaced with the pipe (|) character.

      The list of special characters is defined in the Technical Notes section.

    • Remove duplicate characters. If you select this option, any sequences that repeat the characters will be removed. For example, "aaaaa" would be removed.

    • Remove email addresses. If you select this option, any sequence of the format <string>@<string> will be removed.

    • Remove URLs. If you select this option, any sequence including the following formats will be removed:

      • http and https

      • ftp

      • www prefix

  12. The option Expand verb contractions is applicable only to languages that use verb contractions; currently, English only. For example, by selecting this option, you could replace the phrase "wouldn't stay there" with "would not stay there".

  13. Use the option Normalize backslashes to slashes to map all instances of "\" to "/".

  14. Select the option Split tokens on special characters if you want to break words on characters such as &, -, and so forth. For example, if you select this option, MS-WORD would be separated into two tokens, MS and WORD.

The following examples in the Cortana Intelligence Gallery illustrate the use of the Preprocess Text module:

This section provides more information about the underlying text pre-processing technology, and how to specify custom text resources.

Supported Languages

Currently Azure Machine Learning supports text preprocessing in these languages:

  • Dutch

  • English

  • French

  • German

  • Italian

  • Spanish

Additional languages are planned. See the Microsoft Machine Learning blog for announcements.

Lemmatization

Lemmatization is the process of identifying a single canonical form to represent multiple word tokens.

The natural language processing libraries included in Azure Machine Learning combine the following multiple linguistic operations to provide lemmatization:

  • Sentence separation. In free text used for sentiment analysis and other text analytics, sentences are frequently run-on or punctuation might be missing. Input texts might constitute an arbitrarily long chunk of text, ranging from a tweet or fragment to a complete paragraph, or even document.

    The natural language tools used by Azure ML perform sentence separation as part of the underlying lexical analysis. However, sentences are not separated in the output. Optionally, you can specify that a sentence boundary be marked to aid in other text processing and analysis.

  • Tokenization. The rules that determine the boundaries of words are language-dependent and can be complex even in languages that use spaces between words. Some languages (such as Chinese or Japanese) do not use any white space between words, and separation of words requires morphological analysis. Therefore, the tokenization methods and rules used in this module will provide different results from language to language. These tokenization rules are determined by text analysis libraries provided by Microsoft Research for each supported language, and cannot be customized.

  • Part-of-speech identification. Given a sequence of words, many words will be ambiguous and will have multiple options for part of speech. Parts of speech are also very different depending on the morphology of different languages.

    In Azure Machine Learning, a disambiguation model is used to choose the single most likely part of speech, given the current sentence context. The part-of-speech information is used to help filter words used as features and aid in key-phrase extraction. However, the output of this module does not explicitly include POS tags and therefore cannot be used to generate POS-tagged text.

  • Generating dictionary form. A word may have multiple lemmas, or dictionary forms, each coming from a different analysis. For instance, the English word building has two possible lemmas: building if the word is a noun ("the tall building"), or build if the word is a verb ("they are building a house"). In Azure Machine Learning, only the single most probable dictionary form is generated.

Example

SourceLemmatized with case conversion
He is swimminghe i swim
He is going for a swimhe i go for a swim
Swimming is good for building muscleswim be good for build muscle
He is building a buildinghe i build a build
We are all building buildingswe be all build building
System_CAPS_ICON_note.jpg Note

The language models used to generate dictionary form have been trained and tested against a variety of general purpose and technical texts, and are used in many other Microsoft products that require natural language APIs. However, natural language is inherently ambiguous and 100% accuracy on all vocabulary is not feasible. For example, lemmatization can be affected by other parts of speech, or by the way that the sentence is parsed.

If you need to perform additional pre-processing, or perform linguistic analysis using a specialized or domain-dependent vocabulary, we recommend that you use customizable NLP tools, such as those available in Python and R.

Special Characters

Special characters are defined as single characters that cannot be identified as any other part of speech, and can include punctuation: colons, semi-colons, and so forth.

Stopwords

A stopword is a word that is often removed from indexes because it is common and provides little value for information retrieval, even though it might be linguistically meaningful. For example, many languages make a semantic distinction between definite and indefinite articles ("the building" vs "a building"), but for machine learning and information retrieval, the information is often unreliable, not available, or not relevant. Hence it is common practice to discard these words.

The Azure Machine Learning environment includes lists of the most common stopwords for each of the supported languages.

LanguageNumber of stopwordsExamples
Dutch49aan, af, al
English312a, about, above
French154de, des, d', la
German602a, ab, aber
Italian135a, adesso, ai
Spanish368ésa, ésta, éste

For your convenience, a zipped file containing the default stopwords for all current languages has been made available on Azure: Stopwords.zip.

How to Modify the Stopword List

We expect that many users will want to create their own stopword lists, or change the terms included in the default list. The following experiment in the Cortana Intelligence Gallery demonstrates how you can customize a stop word list.

If you modify the list, or create your own stop word list, observe these requirements:

  • The file must contain a single text column.

    You might get the following error if an additional column is present:

    Preprocess Text Error Column selection pattern "Text column to clean" is expected to provide 1 column(s) selected in input dataset, but 2 column(s) is/are actually provided. ( Error 0022 )

    This can happen as a result of spaces, tabs, or hidden columns present in the file from which the stopword list was originally imported. Depending on how the file was prepared, tabs or commas included in text can also cause multiple columns to be created. If you get this error, you can review the source file, or use the Select Columns in Dataset module to choose a single column to pass to the Preprocess Text module.

  • Each row can contain only one word. For the purposes of parsing the file, words are determined by insertion of spaces.

  • The stopword list cannot be empty.

In this module, you can apply multiple operations to text. However, the order in which these operations are applied cannot be changed. This can affect the expected results.

For example, if you apply lemmatization to text, and also use stopword removal, all the words are converted to their lemma forms before the stopword list is applied. Therefore, if your text includes a word that is not in the stopword list, but its lemma is in the stopword list, the word would be removed.

Be sure to test target terms in advance to guarantee the correct results.

If your text column includes languages not supported by Azure Machine Learning, we recommend that you use only those options that do not require language-dependent processing. This can help avoid strange results.

Also, if you use the option Column contains language, you must ensure that no unsupported languages are included in the text that is processed. If an unsupported language or its identifier is present in the dataset, the following run-time error will be generated:

Preprocess Text Error (0039): Please specify a supported language.

To avoid failing the entire experiment because an unsupported language was detected, you can use the Split Data module, and specify a regular expression to divide the dataset into supported and unsupported languages. For example, the following regular expression splits the dataset based on the detected language for the column Sentence:

\"Sentence Language" Dutch|English|French|Italian|Spanish

If you have a column that contains the language identifier, or if you have generated such a column, you can use a regular expression such as the following, to filter on the identifier column:

\"Sentence Iso6391 Name" nl|en|fr|it|es

NameTypeDescription
DatasetData TableInput data
Stop wordsData TableOptional custom list of stop words to remove
NameTypeRangeOptionalDefaultDescription
Remove URLsBooleanTrue

False
RequiredtrueRemove URLs
LanguageLanguageEnglish

Spanish

French

Dutch

German

Italian
RequiredEnglishSelect the language to preprocess
Text column to cleanColumn SelectionRequiredStringFeatureSelect the text column to clean
Custom regular expressionStringOptionalSpecify the custom regular expression
Custom replacement stringStringOptionalSpecify the custom replacement string for the custom regular expression
Remove stop wordsBooleanRequiredtrueRemove stop words
LemmatizationBooleanRequiredtrueUse lemmatization
Remove by part of speechTrue False Typetrue

false
RequiredFalseIndicate whether part-of-speech analysis should be used to identify and remove certain word classes
Remove nounsBooleanApplies when the Filter by part of speech option is selectedtrueRemove nouns
Remove adjectivesBooleanApplies when the Filter by part of speech option is selectedtrueRemove adjectives
Remove verbsBooleanApplies when the Filter by part of speech option is selectedtrueRemove verbs
Detect sentencesBooleanRequiredtrueDetect sentences by adding a sentence terminator \"|||\" that can be used by the n-gram features extractor module
Normalize case to lowercaseBooleanRequiredtrueNormalize case to lowercase
Remove numbersBooleanRequiredtrueRemove numbers
Remove special charactersBooleanRequiredtrueRemove non-alphanumeric special characters and replace them with \"|\" character
Remove duplicate charactersBooleanRequiredtrueRemove duplicate characters
Remove email addressesBooleanRequiredtrueRemove email addresses
NameTypeDescription
Results datasetData TableResults dataset
ExceptionDescription
Error 0003An exception occurs if one or more of inputs are null or empty.
Error 0030an exception occurs in when it is not possible to download a file.
Error 0048An exception occurs when it is not possible to open a file.
Error 0049An exception occurs when it is not possible to parse a file.

Text Analytics

A-Z Module List

Show: