Beyond Keywords: Structure and Intelligence from Text
Henry B. Kon, PhD
Summary: This article reviews key concepts in text analytics. Text processing for business intelligence involves a variety of tools, with many applications that can be of great value. Keyword search is just a start. Beyond keyword search, the relevance of a document to a query is based on mathematical measures. (6 printed pages)
Max was on edge. "I've never had an opportunity to affect our business so directly." Contoso Pharmaceuticals was in a bind over "off-label" use of their products—use of medications for conditions other than those approved by regulators, such as for recreational purposes. Such usage can result in huge liabilities to manufacturers, if not managed. Contoso needed to adjust to a cutback on their sales force, leaving fewer "ears on the ground." They also needed to address criticism by physicians, health plans, and consumer groups frustrated with the aggressive marketing tactics of Contoso and the entire pharmaceutical industry. Contoso's CIO wanted a newsreader to detect such occurrences by scouring the Web and related content.
Max planned to route RSS content—primarily, industry news and blogs—to expert risk managers, based on a medicine's name or chemical compound appearing in an article. He also planned to implement document storage, text query, and document alert and routing, based on keyword. As soon as he could prove that keyword filtering of news feeds was useful, he would consider more advanced tools and varied sources, such as online illness support groups, teen Web sites and communities, internal e-mails, risk-management reports, and regulatory filings.
Full-text indexing with keyword search was a good starting point, as Max knew, but he also knew of additional approaches that could bridge into analytic and more automated applications. Max's road to success, however, would have to start with incremental, near-term, and achievable deliverables—leaving the more advanced features for later.
The field of text analytics, closely related to text mining, can largely be broken down into two activities:
- Entity extraction—Identify (with varying degrees of certainty) entities, strings, and sentences within the text—such as dates, events, places, chemical compounds, regular expressions, and natural-language statements—with various resulting reports or alerts.
- Inference—Derive probabilistic statements about the real world from such information.
Natural languages lack structure in the sense that semantics may be hidden or implied, as compared with structured database data in which syntax is restricted and meaning is largely agreed upon in advance. Approaches to text analytics are often statistical, such as counting occurrences and proximity of terms within text. A term can be a phrase, string, or regular expression. Google Trends is a simple example of text analytics. It graphs counts of search-phrase use over time, showing popularity trends and tracking levels of interest, with subtotals by region of the world. This database of world interest serves as an index of human activity.
The scientific foundations for text analytics can be found in the field of information retrieval, which ultimately involves various mathematics and philosophies. The basic measure in information retrieval is the relevance of a document to a query, answering the question: What is the end-user value of a document, relative to the user's interest or relative to a topic? Relevance is detected by using a text query or classification algorithm. Term frequency–inverse document frequency (TF-IDF) is the most time-tested measure of relevance in text analytics:
"TF-IDF is a weight...a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus." 
Term frequency and rarity—jointly, a measure of discriminatory power—are essential to determining relevance.
IBM's open-source project, known as the Unstructured Information Management Architecture (UIMA), is an example of an integration point for analytic technologies. This, according to IBM, is "a scalable and extensible platform for building analytic applications or search solutions that process text or other unstructured information to find the latent meaning, relationships, and relevant facts buried within."
Sometimes, the end result in a text-analytics system is purely factual, statistical information about document-set characteristics, such as, "How often does a medicine's name appears in the same paragraph as the name of an illness?" This is akin to how a database query provides factual information about database content. Of course, making decisions based on such information was the real goal of Max's work.
Max knew that the term space representing conditions such as obesity or psoriasis would have to be articulated. The body's organ systems constitute another conceptual space that would be of interest, relative to these conditions. The set of strings that correspond to the company's products in text would be described via regular expressions. Knowledge acquisition—the process of transforming human knowledge into machine representations—is a large expense in such systems.
As for implementation, some of these types of queries can be executed by using standard database-text processing functions—requiring external indexing and search systems. Some databases include multilingual keyword-indexing schemes and other algorithms for approximate matching. A synonym table allows multiple term representations to be considered identical—enabling Boolean full-text search with thesaurus support. Sometimes, more advanced inferences are of interest.
Max's end vision included the ability—for any encountered article or document—to have the software "decide" if the text is discussing, promoting, or discouraging the use of a given medicine, relative to a given illness; to assign a relevance score; and even a relative likelihood of such an assertion. To do this, he would have to implement a knowledge schema to provide a mapping from his concept space to an appropriate machine representation. Knowledge schemas can be encoded by using knowledge-representation scheme-modeling predicates, concepts, type hierarchies, taxonomies, and logical inference. A semantic representation scheme is a framework in which to store base, intermediate, and final forms of knowledge. Several rich knowledge-representation schemes are emerging for semantics and semantic connectivity for Web services such as W3C's OWL and DARPA's DAML. Max kept this longer-term vision in his back pocket for a little while longer.
A taxonomy is an often hierarchical decomposition of a term or concept space. Taxonomies offer indexing schemes by which to organize document sets. Practical uses for taxonomies include navigation and categorization. These can be complemented by controlled dictionaries and thesauri. A lot of effort—years, for a large term and concept space—might be required to articulate a taxonomy manually. The cost of developing one might be as high as several hundred dollars per node. This seems reasonable, considering the involvement of IT staff, corporate librarians, departmental publishers, commercial-information providers, and standards bodies. As a result, taxonomies are often implemented as shallow hierarchies using a string (for instance, "obesity") as the terminal concept representation (no concept articulation beyond the term itself). In fact, a concept may be a richly multifaceted structure including a vector of facets, weights, and links within a parent-child hierarchy or other relational network or inference chain. The leveraging of existing works and use of off-the-shelf taxonomies are ways to reduce these costs.
Science aside, Max's CIO warned him that the politics of this project could be an issue. Contoso's risk managers could become concerned that their jobs would be taken away.
Max considered this knowledge-gathering and tracking system an enabler in an unnecessarily labor-intensive process. He was excited by the potential of advanced methods to automate tasks that required so much manual effort. But Max knew that no matter how good his solution, if people didn't use it, it wouldn't have any value. So, he started humbly by automating processes that risk managers had already been performing, such as keyword searching of the Web through public search engines—the output of which he scraped, and then indexed into a document archive within the company's document-management system. This infrastructure he shared across the risk-management groups under each of the company's products and geographic markets.
Today, Max continues to test more advanced text-analytics features, while incorporating continuous user feedback and reality checks into the system that he currently evolves. He and the CIO are convinced that automation along these lines is critical to their long-term success.
This article has reviewed key concepts in text analytics. Text processing for business intelligence involves a variety of tools, with many applications that can be of great value. Beyond keyword search, the relevance of a document to a query is based on mathematical measures, such as TF-IDF. Such measures form the foundation of a variety of filtering, routing, alerting, and retrieval tasks. This broad area of activity is largely covered by the field of information retrieval.
When building a system along these lines, plan on realizable objectives as near-term deliverables, while testing advanced features for future implementation. A team of two developers might spend a year building Web-scraping tools and search-engine and news-feed–interface tools–automating and enabling some of the manual search and document alerting that had been performed historically by knowledge workers. Do not underestimate the effort involved in human-knowledge acquisition; it can be significant if "homegrown" document indexing and organization schemes are sought.
- What are the sources and sizes of document sets and streams to be analyzed? How large is the topic space or knowledge schema? How dynamic is it?
- Will there be any formal knowledge-representation or document-organization schemes? Will these be homegrown and stored in the database? Will more formal knowledge-representation schemes be considered? What repository will store derived information?
- Are there experts who will read, validate, refine, and calibrate various intelligence-gathering and inference processes over time? Are these experts threatened by the use of an automated system?
- Should the system process be primarily automatic and self-sustaining? If not, will responsibility be centralized or distributed?
- Will the end results of your queries be knowledge about the states of documents (term frequency and trends, regular expression conformance, term proximity and co-occurrence counts, vector-representations of author keywords, and so on)? Alternately, will the end results be concerned with the inferred probabilistic states of the world, as inferred from the documents (the likelihood that a given off-label usage will result in injury or penalty to Contoso)?
- Which is desired: a point solution or a more general infrastructure? What is a realistic end goal by which to define project success?
- DAML.org. Project home page. (Accessed January 2, 2007.)
- Google Trends. Online data source. (Accessed December 13, 2006.)
- Information retrieval. Wikipedia article. (Accessed December 15, 2006.)
- OWL Web Ontology Language Overview. OWL introduction page. (Accessed January 2, 2007.)
- Taxonomy Warehouse. Commercial site. (Accessed January 5, 2007.)
- Text analytics. Wikipedia article. (Accessed December 12, 2006.)
- TF-IDF. Wikipedia article. (Accessed December 20, 2006.)
- UIMA Java Framework. Project home page. (Accessed December 20, 2006.)
About the author
Henry B. Kon, PhD, is cofounder of Intellisophic, Inc., a publisher of taxonomies derived from reference content. Henry has experience in advanced software systems and has done research under ARPA, NSF, and corporate sponsors in heterogeneous database integration, ontologies, and data quality. He has published—and holds patents—in these areas, too. His Doctoral thesis, nicknamed "Good Answers from Bad Data," provided mathematics for calculating error terms in query outputs. Henry can be reached at email@example.com.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.