July 2018

Volume 33 Number 7

[Cognitive Services]

Improving LUIS Intent Classifications

By Zvi Topol | July 2018

The Language Understanding Intelligence Service (LUIS), which is part of Microsoft Cognitive Services, offers a machine learning solution for natural language understanding. There are many use cases for LUIS, including chat bots, voice interfaces and cognitive search engines.

In a nutshell, when given a textual user input, also known as an utterance, LUIS returns the intent detected behind the utterance, that is, what the user intends to ask about. It also detects the different entities—references to real-world objects—that appear in the utterance. Additionally, it outputs a confidence score for each intent and entity detected. Those are numbers in the range [0, 1], with 1 indicating the most confidence about the detection and 0 being the least confident about it.

Previous MSDN Magazine articles have covered the basics of LUIS in detail. In particular, I encourage you to refer to the article, “Enable Natural Language Interaction with LUIS,” by Ashish Sahu (msdn.com/magazine/mt745095) for additional information about how to get started with LUIS.

This article will focus on two open source tools, Scattertext and LIME, which can help you understand the detection and classification of intents by LUIS. (In what follows, I’ll use detection and classification interchangeably.)

In particular, I’ll show how such tools can be used to shed some light on the classification process and explain why LUIS is uncertain about its intent detection in some cases—typically situations in which the top intents detected for a given utterance have similar confidence scores, for example a 50-50 split between two intents. It’s more likely to output the wrong intent in such situations.

While LUIS currently supports some troubleshooting capabilities, including active learning to help identify and retrain utterances it’s uncertain about, there are no word-level visualization and analysis tools that can further help resolve such uncertainty. Scattertext and LIME can help in overcoming that limitation.

Now let’s take a look at a simple FinTech case that will serve as a running example. Imagine you work for a bank and you’ve been tasked with understanding user questions that fall into two categories:

  • Questions about their personal bank accounts, such as:
    “What is my savings account balance?”
    “What is the latest transaction in my checking account?”
    “I would like my savings statement to be sent again”
    “Have I received my April salary yet?”
    “When was the last cell phone auto pay processed?”
    “What are annual rates for my savings accounts?”
    “What is the balance in my checking account?”
  • Questions or requests about other banking services, including mortgages, auto loans, and so forth, such as:
    “I would like to get assistance about mortgage rates”
    “Whom can I speak with regarding mortgages?”
    “What is the annual rate for the one-year savings account?”
    “What terms do you offer for mortgages?”
    “Who is responsible for mortgages?”
    “What are annual rates for savings accounts?”
    “How are your mortgage rates compared to other banks?”

The plan is to use LUIS for natural language understanding of the user requests. One way to go about this is to create two intents and train LUIS to detect them.

Let’s call the first category’s intent PersonalAccountsIntent, and the second category’s intent OtherServicesIntent. You can then use the utterance examples previously listed to train LUIS. It will create a third “catch-all” intent automatically called None for general utterances, which should be very different from the first two intents. You can also provide additional examples for the None intent.

After training, you can publish your model to production. You can also see the utterances along with the confidence scores for the different intents in the LUIS UI, as shown in Figure 1.

PersonalAccountsIntent Utterances with Their Confidence Scores
Figure 1 PersonalAccountsIntent Utterances with Their Confidence Scores

The dashboard offers some basic summary statistics about the application. If you look at the dashboard in Figure 1, you’ll notice that the lowest confidence score for PersonalAccountsIntent is 0.59 and is obtained for the utterance, “what are annual rates for my savings accounts?” The confidence score for this utterance to be classified as OtherServicesIntent is pretty close at 0.44. This means that LUIS is not very certain as to how to classify this intent.

Ideally, you want your intents to be distinguishable from one another with a high degree of certainty, that is, to have one intent with a very high confidence score, while other intents have very low scores. If you revisit the utterance lists for both intents, you’ll see there’s another very similar utterance example (“what is the annual rate for the one-year savings account?”) that’s labeled differently as OtherServicesIntent.

Using this insight, you can fine-tune your utterance samples to use different and distinct words.

Here, I’ve presented seven utterance examples for each intent. But what if there were multiple intents (at the time of writing LUIS can classify up to 500 different intents) and many more utterance examples for each intent?

Clearly, a more systematic approach is needed to address such a challenge. In what follows, I’ll show how Scattertext and LIME can help.

Understanding Intent Classification Using Scattertext

Scattertext is an open source tool written in Python by Jason Kessler. You’ll find the source code and a tutorial at bit.ly/2G0DLmp, and a paper entitled “Scattertext: a Browser-Based Tool for Visualizing How Corpora Differ,” which explains the tool in detail, at bit.ly/2G05ow6.

Scattertext was conceived as a tool to visualize the differences and similarities between two collections of text articles, also known as corpora, and has various features you may find useful; for example, it also supports emojis.

In this article, I’m going to leverage the tool to produce a visualization of the differences and similarities between the utterance examples for the two intents, PersonalAccountsIntent and OtherServicesIntent.

To install Scattertext, which requires Python version 3, follow the installation instructions in the tutorial. I also recommended you install Spacy, an open source Natural Language Processing library (spacy.io) and Pandas (pandas.pydata.org), another open source library that lets you work with tabular data in-memory.

Now I need to feed the utterance examples into Scattertext. To do that, I’ll create a CSV table with two columns, one for the utterances and the other for the intents. The utterance column will include the utterance examples as one string, separated by the new-line character. (If you’re using Excel, you can use Alt+Enter to enter multiple lines into a single cell.) The intent column will include the labels of the intents, in this case, PersonalAccountsIntent and OtherServices­Intent. So, for this example the result is a 2x2 CSV table.

You can now use Python to run the code in Figure 2. The code will load the CSV table into a Panda data frame and then hand it over to Scattertext, specifying a few parameters related to categories (the intents) and the output format.

Figure 2 Code for Scattertext Visualization

import scattertext as st
import space
import pandas as pd
examples_data_location = 'example.csv'
two_df = pd.read_csv(examples_data_location, encoding = 'utf8')
nlp = spacy.en.English()
corpus = st.CorpusFromPandas(two_df,
                              category_col='intent',
                              text_col='utterance',
                              nlp=nlp).build()
html = st.produce_scattertext_explorer(corpus,
  category='PersonalAccountsIntent',category_name='PersonalAccountsIntent',
  not_category_name='OtherServicesIntent', width_in_pixels=1000)
open("MSDN-Visualization.html", 'wb').write(html.encode('utf-8'))

Scattertext will produce an HTML page that includes a visual­ization showing the top words unique for each intent, as well as those shared by both intents. There’s also a search box that lets you look for particular words, that if found, are highlighted in the visualization. In a crowded visualization, this can be very useful. Figure 3 shows the Scattertext output for this example.

Scattertext Visualization
Figure 3 Scattertext Visualization

Scattertext works by counting word frequencies for each intent’s utterance examples and displaying the words in a way that makes it easier to determine differences and similarities between the intents. At this point, the counts only include one-word expressions (unigrams). However, if you have expressions that include multiple words, such as “auto pay,” you can do some pre-processing to specify what you want. For example, you could represent “auto pay” as “auto_pay.”

The visualization in Figure 3 shows the two intents—OtherServicesIntent on the X axis and PersonalAccountsIntent on the Y axis. Words that appear closer to the bottom right are more likely to appear in utterance examples for OtherServicesIntent, such as “mortgages” and “rates,” while words that appear on the top left are those that are more likely to appear in utterance examples for PersonalAccountsIntent, such as “my” and “account.” Words on the diagonal are likely to appear in utterance examples for both intents, for example, “savings” or “what.”

Learning that certain words appear frequently in both intents’ utterance examples can help you fine-tune the utterance examples to improve classification confidence and accuracy.

One way to do so is by adding more distinct words or by even rephrasing each intent’s utterance examples that include the words frequently in both so as to render them more distinguishable.

The advantage of using Scattertext is that it’s possible to get value from the tool even for small data sets, such as my toy example with only seven utterance examples for each intent. Clearly, the more utterance examples per intent you have, the more complicated it becomes to find the differences and similarities among them. Scattertext can help you appreciate the differences and similarities in a rapid visual way.

It’s also worth noting that you can use Scattertext in a similar fashion when you have more than two intents by comparing pairs of intents at a time.

Explaining Intent Classifications Using LIME

Now let’s look at an open source tool called LIME, or Local Interpretable Model-Agnostic Explanation, which allows you to explain intent classification. You’ll find the source code and a tutorial at bit.ly/2I4Mp9z, and an academic research paper entitled, “Why Should I Trust You?: Explaining the Predictions of Any Classifier” (bit.ly/2ocHXKv).

LIME is written in Python and you can follow the installation instructions in the tutorial before running the code in Figure 4.

Figure 4 Using LIME to Analyze Utterances

import requests
import json
from lime.lime_text import LimeTextExplainer
import numpy as np
def call_with_utterance_list(utterance_list) :
  scores=np.array([call_with_utterance(utterance) for utterance in
    utterance_list])
  return scores
def call_with_utterance(utterance) :
  if utterance is None :
    return np.array([0, 1])
  app_url ='your_url_here&q='
  r = requests.get(app_url+utterance)
  json_payload = json.loads(r.text)
  intents = json_payload['intents']
  personal_accounts_intent_score =
    [intent['score'] for intent in intents if intent['intent'] ==
    'PersonalAccountsIntent']
  other_services_intent_score = [intent['score'] for intent in intents if
    intent['intent'] == 'OtherServicesIntent']
  none_intent_score = [intent['score'] for intent in intents if
    intent['intent'] == 'None']
  if len(personal_accounts_intent_score) == 0 :
      return np.array([0, 1])
  normalized_score_denom = personal_accounts_intent_score[0]+
    other_services_intent_score[0]+none_intent_score[0]
  score = personal_accounts_intent_score[0]/normalized_score_denom
  complement = 1 - score
  return (np.array([score, complement]))
if __name__== "__main__":
  explainer = LimeTextExplainer(class_names=['PersonalAcctIntent', 'Others'])
  utterance_to_explain = 'What are annual rates for my savings accounts'
  exp = explainer.explain_instance(utterance_to_explain,
    call_with_utterance_list, num_samples=500)
  exp.save_to_file('lime_output.html')

LIME allows you to explain classifiers for different modalities, including images and text. I’m going to use the text version of LIME, which outputs word-level insights about the various words in the utterance. While I’m using LUIS as my classifier of choice, a wide range of classifiers can be fed into LIME; they’re essentially treated as black boxes.

The text version of LIME works roughly as follows: It randomly creates multiple modifications or samples of the input utterance by removing any number of words, then calls LUIS on each one of them. The number of samples is controlled by the parameter num_samples, which in Figure 4 is set to 500. For the example utterance, modified utterances can include variations such as “are annual for accounts” and “what annual rates for my savings.”

LIME uses the confidence scores returned from LUIS to fit a linear model that then estimates the effects of single words on classification confidence scores. This estimation helps you identify how the confidence score is likely to change if you were to remove words from the utterance and run the classifier again (as I show later).

The only major requirement for the classifier is to output confidence scores for the classified labels. Confidence scores over the different categories are treated as a probability distribution, and therefore should be in the range of [0,1] and sum to 1. LUIS outputs confidence scores in that range for the defined intents and the additional None intent, but those aren’t guaranteed to sum to 1. Therefore, when using LIME, I’ll normalize the LUIS scores to sum to 1. (This is done in the function call_with_utterance.)

The code listed in Figure 4 uses LIME to produce an explanation about the prediction for the utterance, “what are annual rates for my savings accounts?” It then generates an HTML visualization, which is presented in Figure 5.

LIME Output for the “What Are Annual Rates for My Savings Accounts?” Utterance
Figure 5 LIME Output for the “What Are Annual Rates for My Savings Accounts?” Utterance

In Figure 5 you can see the predicted probabilities for the utterance, focused here on PersonalAccountsIntent rather than the two other intents, OtherServicesIntent and None. (Note that the probabilities are very close to but not exactly the same as the confidence scores output by LUIS due to normalization.) You can also see the most significant words for classifying the intent as PersonalAccountsIntent (those are words on top of the blue bars and are also highlighted in blue in the utterance text). The weight of the bar indicates the effect on the classification confidence score should the word be removed from the utterance. So, for example, “my” is the word with the most significant effect for detecting the utterance’s intent in this case. If I were to remove it from the utterance, the confidence score would be expected to reduce by 0.30, from 0.56 to 0.26. This is an estimation generated by LIME. In fact, when removing the word and feeding the “what are annual rates for savings accounts?” utterance into LUIS, the result is that the confidence score for PersonalAccountsIntent is 0.26 and the intent is now classified as OtherServicesIntent, with a confidence score of about 0.577 (see Figure 6).

Figure 6 Results for the “What Are Annual Rates for My Savings Accounts?” Query

{
  "query": "what are annual rates for savings accounts",
  "topScoringIntent": {
    "intent": "OtherServicesIntent",
    "score": 0.577525139
  },
  "intents": [
    {
      "intent": "OtherServicesIntent",
      "score": 0.577525139
    },
    {
      "intent": "PersonalAccountsIntent",
      "score": 0.267547846
    },
    {
      "intent": "None",
      "score": 0.00754897855
    }
  ],
  "entities": []
}

Other significant words are “accounts” and “savings,” which together with “my” provide similar insights to the ones provided by Scattertext.

Two important words with significant negative weights are “annual” and “rates.” This means that removing them from the utterance would increase the confidence scores for the utterance to be classified as PersonalAccountsIntent. Scattertext showed that “rates” is more common in utterance examples for Other­ServicesIntent, so this isn’t a big surprise.

However, there is something new to be learned from LIME—the word “annual” is significant for LUIS in determining that the intent in this case doesn’t belong in the PersonalAccountsIntent, and removing it is expected to increase the confidence score for PersonalAccountsIntent by 0.27. Indeed, when I remove annual before feeding the utterance, I get a higher confidence score for the PersonalAccountsIntent intent, namely 0.71 (see Figure 7).

Figure 7 Results for the “What Are Rates for My Savings Accounts?” Query

{
  "query": "what are rates for my savings accounts",
  "topScoringIntent": {
    "intent": "PersonalAccountsIntent",
    "score": 0.71332705
  },
  "intents": [
    {
      "intent": "PersonalAccountsIntent",
      "score": 0.71332705
    },
    {
      "intent": "OtherServicesIntent",
      "score": 0.18973498
    },
    {
      "intent": "None",
      "score": 0.007595492
    }
  ],
  "entities": []
}

In this way, LIME helps you identify significant words that drive classification confidence scores. It can thus provide insights that help you fine-tune your utterance examples to improve intent classification accuracy.

Wrapping Up

I have shown that when developing an application based on NLU, intent prediction for some utterances can be rather challenging and can be helped by a better understanding of how to fine-tune utterance examples in order to improve classification accuracy.

The task of understanding word-level differences and similarities among utterances can yield concrete guidance in the fine-tuning process.

I’ve presented two open source tools, Scattertext and LIME, that provide word-level guidance by identifying significant words that affect intent prediction. Scattertext visualizes differences and similarities of word frequencies in utterance examples, while LIME identifies significant words affecting intent classification confidence scores.

I hope these tools will help you build better NLU-based products using LUIS.


Zvi Topol has been working as a data scientist in various industry verticals, including marketing analytics, media and entertainment and Industrial Internet of Things. He has delivered and lead multiple machine learning and analytics projects including natural language and voice interfaces, cognitive search, video analysis, recommender systems and marketing decision support systems. He can be contacted at zvitop@gmail.com.

Thanks to the following Microsoft technical expert who reviewed this article: Ashish Sahu


Discuss this article in the MSDN Magazine forum