Named Entity Recognition
Updated: February 23, 2017
Recognizes named entities in a text column
Category: Text Analytics
You can use the Named Entity Recognition module to identify the names of things, such as people, companies, or locations in a column of text. The module then labels the sequences where these words were found so that you can use the terms in further analysis.
Named entity recognition is an important area of research in machine learning and natural language processing (NLP), because it can be used to answer many real-world questions, such as:
Does a tweet contain the name of a person? Does the tweet also provide his current location?
Which companies were mentioned in a news article?
Were specified products mentioned in complaints?
To get a list of named entities, you provide a dataset as input that contains a text column. The Named Entity Recognition module will then identify three types of entities: people (PER), locations (LOC), and organizations (ORG).
For example, the following table shows a simple input sentence, and the terms and values generated by the module:
| Input text | Module output |
|---|---|
| “Boston is a great place to live.” | 0,Boston,0,6,LOC |
The output can be interpreted as follows:
The first ‘0’ means that this string is the first article input to the module.
Because a single article can have multiple entities, including the article row number in the output is important for mapping features to articles.
‘Boston’ is the recognized entity.
The ‘0’ that follows ‘Boston’ means the entity ‘Boston’ starts from the first letter of the input string. (Indices are zero-based.)
‘6’ means the length of ‘Boston’ is 6.
‘LOC’ means the recognized ‘Boston’ is a location. Other supported named entity types are person (PER) and organization (ORG).
Add the Named Entity Recognition module to your experiment.
Connect the following inputs:
Story. A source of text from which to extract named entities. This input is required. The column used as Story should contain multiple rows, where each row consists of a string. the string can be short, like a sentence, or long, like a news article.
You can connect any dataset that contains a text column. However, if the input dataset contains multiple columns, use Select Columns in Dataset to choose only the column that contains the text you want to analyze.
Custom Resources (Zip). An optional set of linguistic resources in ZIP file format
Currently this input is disabled. In future versions of this module, the right-hand input port will accept custom resource files, for identifying different entity types.
Run the experiment.
Results
The module outputs a dataset containing a row for each entity that was recognized, together with the offsets.
Because each row of input text might contain multiple named entities, an article ID number will be automatically generated and included in the output, to identify the input row that contained the named entity. The article ID is based on the natural order of the rows in the input dataset.
You can convert this to CSV for download or save it as a dataset for re-use.
If you publish a web service from Azure Machine Learning Studio and want to consume the web service by using C#, Python, or another language such as R, you must first implement the service code provided on the help page of the web service.
If your web service provides multiple rows of output, the URL of the web service that you add to your C#, Python, or R code should have the suffix scoremultirow instead of score.
For example, assume you use the following URL for your web service:
https://ussouthcentral.services.azureml.net/workspaces/<workspace id>/services/<service id>/score
To enable multi-row output, change the URL to:
https://ussouthcentral.services.azureml.net/workspaces/<workspace id>/services/<service id>/scoremultirow
To publish this web service, you should add an additional Execute R Script module after the Named Entity Recognition module, to transform the multi-row output into a single delimited with semi-colons (;).
The reason for consolidating the multiple rows of output into a single row is to return multiple entities per input row. For example, let’s assume you have an input sentence with two named entities. Rather than returning two rows for each row of input, you can return a single rows with multiple entities, separated by semi-colons as shown here:
| Input Text | Output of Web Service |
|---|---|
| Microsoft has two office locations in Boston. | 0,Microsoft,0,9,ORG,;,0,Boston,38,6,LOC,; |
The following code sample demonstrates how to do this:
Rscript is:
# Map 1-based optional input ports to variables
d <- maml.mapInputPort(1) # class: data.frame
y=length(d) ##size of cols
x=dim(d)[1] ##size of rows
longd=matrix("NA",nrow=1,ncol=x*(y+1))
for (i in 1:x)
{
for (j in 1:y)
{
longd[1,j+(i-1)*(y+1)]=toString(d[i,j])
}
longd[1,j+(i-1)*(y+1)+1]=c(";")
}
final_output=as.data.frame(longd)
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("final_output");
This blog provides an extended explanation of how named entity recognition works, its background, and possible applications:
See the following sample experiments in the Cortana Analytics Gallery for demonstrations of how to use text classification methods commonly used in machine learning:
The News Categorization sample uses feature hashing to classify articles into a predefined list of categories.
The Similar Companies sample uses the text of Wikipedia articles to categorize companies.
Text-Classification Step 1 of 5: Data preparation In this five-part walkthrough of text classification, text from Twitter messages is used to perform sentiment analysis. A variety of text pre-processing techniques are also demonstrated.
Language Support
Currently, the Named Entity Recognition module supports only English text. It can detect organization names, personal names, and locations in English sentences. If you use the module on other languages, you might not get an error, but the results will not be as good as for English text.
In future, support for additional languages can be enabled by integrating the multilingual components provided in the Office Natural Language Toolkit.
Custom Resources
In future versions of this module, the right-hand input port will accept custom resource files, for identifying different entity-types.
This section will describe the formats and requirements of those resources.
| Name | Type | Description |
|---|---|---|
| Story | Data Table | An input dataset (DataTable) that contains the text column you want to analyze. |
| CustomResources | Zip | (Optional) A file in ZIP format that contains additional custom resources. This option is not available currently and is provided for forward compatibility only. |
| Name | Type | Description |
|---|---|---|
| Entities | Data Table | A list of character offsets and entities |
Text Analytics
Feature Hashing
Score Vowpal Wabbit 7-4 Model
Train Vowpal Wabbit 7-4 Model
A-Z Module List