Applying Topic Maps to Applications
by Kal Ahmed and Graham Moore
Summary: Topic maps provide a fairly simple but very powerful metamodel for the representation of knowledge models. In "An Introduction to Topic Maps" (The Architecture Journal, No. 5, 2005), we presented how a combination of topic maps and architecture design patterns allow developers to build reusable components for applications. Now we'll take a look at some of the current and potential application areas for topic maps. As topic maps are primarily of use when integrated with other systems, we'll also discuss access to topic map information using a variety of Web services architectures from a traditional RPC-like, client/server interaction to syndication models where either a push model or pull model can be used to interchange updates to a topic map.
Benefits for Publishers
A Simple Topic Map Web Service
Identification and URIs
Topic Map Serialization
About the Authors
The principal purpose of topic maps is to allow the expression of a domain knowledge model and to enable that knowledge model to be connected to related resources. Within this broad remit we can identify several common applications for topic maps in an enterprise.
Most organizations are now publishers of resources to some extent. For some, publishing information is the core business, and for most other organizations the information they publish is a key part of their communication to their customers and partners. Publishers of information face a number of challenges that can be addressed with topic maps.
For a large corpus, the search engine is often the only way for newcomers to find what they are looking for. Traditionally, search has been driven by content keywords or by full-text indexing. Topic maps offer the alternative of indexing and searching against topic names, and then using topic occurrences to present links to all content related to the topics found by the search. Each topic in a topic map represents a single concept but can be assigned multiple names, allowing the topic map to store scientific and common-usage names, common misspellings, or translations for concept names. In addition, a topic map may also store semantic relationships between terms that can inform a search in several ways:
- This semantic information could be used to provide alternate search term suggestions to a user, or even to transparently expand the entered search term. For example, with the appropriate associations a topic map could expand a search term such as "Georgian" into "17th Century AND England."
- Topic typing or topic associations could be used to disambiguate distinct concepts with the same term (homonyms) based on their context in the topic map.
- After executing a search that returns multiple resources, those resources could then be grouped according to their classification in the topic map, allowing users to more clearly see the different ways in which their search phrase has been satisfied.
One distinct advantage to search optimization is that a search facility that only queries a topic map is significantly easier to tune. For example, if an electrical retail site discovers that a new search term "PVR" has become popular among customers seeking hard disc video recorders, the topic map that drives site search could be modified to add the term "PVR" to the topic for "Hard Disc Video Recorders." The content connected to that topic would not have to be modified at all; the search term "PVR" will now hit the "Hard Disc Video Recorders" topic and result in the related resources being returned.
Link management. One of the principal ways of keeping a user on a site is to present related links that the user can follow to find more information on a subject. Maintaining such links manually is error-prone and requires a trade-off between constantly updating the links and accepting that older content will have equally aged links to related content. The nature of topic maps as an index of resources enables these sorts of links to be managed almost automatically.
There are many different approaches to presenting related links using a topic map, but in essence they all consist of traversing from the resource to the point(s) in the topic map where that resource is referenced, and then traversing the topic map in some way and then extracting the list of resources that are referenced from the end of the traversal. Managing related links in this dynamic manner has the distinct advantage that related links are always up to date; that the list of related links can be generated based on the indexing of previous content; and that the logic for extracting related links can be modified without the need to modify either the resources or their indexing.
Support multiple routes to content. As discussed in our previous article, topic maps can be used to model many of the traditional content indexing and finding aids such as hierarchical and faceted classification. In fact, a single topic map can contain many such indexes. Creative use of multiple indexes can allow a user to find content from many different entry points. For example, on a site publishing financial analysis, a user might find a particular report by a drill down through market sectors to a company and then to the report, or by geographical region to a country related to the story, or even by traversing personalized topics that represent his or her own portfolio or interests.
Serving multiple audiences. Topic maps provide great flexibility to those organizations that need to provide access to different levels of user. In the first instance, the concepts can be given names familiar to each audience. For example, on a site giving information about drugs, the topic map could provide the clinical name of a drug for doctors and the trade name for patients as different names on the same topic. Going further, the topic map can contain both full and simplified domain models that are combined and can optionally overlap. The principal feature of topic maps to support this requirement is the use of scope to specify the context in which a given association between topics is to be presented to a user.
The key architectural decision in implementing topic maps as part of a publishing solution is how the topic map system is integrated with the content management system. Integration needs to be considered from two aspects: integrating content indexing and content creation, and publishing topic map information with content.
While the creation of a domain knowledge model might be in the purview of a librarian or small set of domain experts, the publishing solution should allow for the classification of content against the domain model to be performed by those responsible for the content, either authors or content editors. Ideally, classifying and indexing the content against the topic map should be made a required part of the content creation/approval life cycle, with classification reviewed as part of the editorial process. A well-designed application will also make use of patterns such as the patterns we discussed in "An Introduction to Topic Maps" (see Resources).
Patterns enable a librarian to make changes to the domain model that reflect changes in the structure or focus of the content without requiring any downstream changes to the management or presentation code. For example, the patterns for hierarchical and faceted classification schemes allow new hierarchies and new classification facets to be introduced in the topic map and to be recognized and displayed automatically by the presentation layer.
Figure 1. Sample architecture for a topic maps-driven Web site (Click on the picture for a larger image)
Figure 1 shows a simplified block architecture for a Web site that makes use of a topic map. The topic map is managed by a topic map engine component that is populated by the information architect and the content creator roles. When publishing the content there are two possible approaches to the integration of information from the topic map. In the first approach, the content management system (CMS) provides the site structure and one or more regions on a page into which information from the topic map can be added. In the second approach a part of the topic map domain model is used to drive the site.
Figure 2. Sample architecture for topic maps on the intranet (Click on the picture for a larger image)
The former approach is most appropriate with CMSs that have strong site-management features or that make use of site structure to implement access controls or other features. The latter approach, however, can be used to create a flexible site structure that can be modified more easily to take account of changes in the way that content indexes are organized. In addition to structuring the content in the CMS, the topic map engine can serve as a useful index of all content available through the Web site. This index can be made available through a Web services interface to rich clients such as an RSS reader or the Research Service in Microsoft Office (see Figure 1).
Topic maps can support a corporate knowledge management application principally by providing a repository for capturing the domain knowledge model. The flexible metamodel provided by topic maps allows new concept types and relationship types to be introduced into the knowledge model with minimal effort, enabling the knowledge model to keep pace with changes in the business.
Figure 3. Three enterprise information integration architectures
In addition to maintaining the knowledge model for the domain, the topic map can also be used to index resources on the intranet. In this respect, the topic map provides similar benefits to those described for general resource publishing, but with the development of collaboration systems such as SharePoint, the extent of what is available through the intranet is rapidly increasing. These systems make it very easy for users to create new intranet content and to share content related to a project, but this sharing can quickly lead to an intranet overloaded with content in which it is difficult to find relevant information and almost impossible for a newcomer to get acquainted with. An index of the intranet content based on topic maps can not only help experienced users find relevant content regardless of its location, but the high-level domain model embodied by the topic map can also be useful in giving newcomers an overview of the activities of the organization.
A significant part of the problem for corporate knowledge management is that users have to work with many tasks that do not have a one-to-one correlation to a piece of content. For example, a user may work on a case or be part of a project or on a cross-functional team or take part in a meeting. All of these tasks may have related content—case notes, project documentation, meeting minutes, and so on—but very often they have no real identity themselves. While keywords can be used to connect content items to such concepts, a simple keyword tells users nothing about what makes the content relevant to the keyword or about relationships among keywords.
A topic map provides a model for defining noncontent items such as people, projects, and places as topics. Once these topics are created, they can then be connected to content items. Topic maps provide two ways of connecting to content. The first is by creating a second topic that represents the content and then using an association. The second is by using an occurrence that points directly to the content items. The model can also be used to lift important relationships among noncontent items out of content into the topic map to provide an overview of the functions of the organization.
For example, a project may have many participants and may be carried out for a particular customer. By making the participants and the customer into topics in the domain model, it would then be possible to quickly locate other projects for the same customer, or to list the products that have been licensed to the customer, or to find the other project commitments of a team member.
A topic map can be used as a knowledgebase by itself. A simple application would be a Web portal with an interface that allows users to create their own topics and associations and to browse the topics and associations created by others. Users could also add links to internal or external resources as occurrences, or the interface could allow users to create or upload content to the portal itself. Indeed, these are all common features of many generic topic map editing applications. However, the most benefit comes when the topic map is integrated into a collaboration system that can provide content management, event management, discussion forums, and other features. As with integration for resource publication, a successful implementation requires that the topic map functionality should be made part of the normal interaction with the collaboration system.
Categorizing content is always going to be seen by some users as an additional burden. This burden can be reduced by using schema-driven classification to minimize the decisions that users need to make about the classification of items, and by making use of the topology of the collaboration system (for example, all documents placed in the folder "Project X" get tagged automatically as being related to the project "Project X"). At the other side of the equation, just a minimal amount of classification can quickly generate benefits for users by bringing otherwise disparate content together, and just a single person working to categorize the content produced by their project or department can result in a better knowledgebase for all users. The inherently flexible nature of ontology based on topic maps makes it very easy to start small, focus on the core ontology needed to address the requirements of just that project or department, and then to subsequently scale up as the project shows a return on investment.
Having created topics and categorized content, the final aspect to the use of topic maps in a corporate knowledge management scenario is access to the topic map information. This aspect is the area where a topic map can really shine. The topic map can provide a structured, high-level domain model that is easily communicated over a Web services interface, giving huge potential for integration into desktop applications. For example, we have recently developed a Web service that implements the Microsoft Office Research Service interface to provide a user with the ability to search and query a topic map from within Internet Explorer and their Office applications.
Figure 2 shows a possible architecture for such a knowledgebase application. The information architect sets up the basic taxonomy and classification schemes for the knowledgebase. The users then work with the collaboration server and populate and extend this taxonomy through browser or rich-client interfaces.
Similar integration possibilities are available using Smart Tags and more detailed service interfaces such as that described later. These integrations allow the information in a topic map to be delivered right to the application in which they are most useful. In addition, the domain model for internal knowledge management can also be used as the basis for integrating data from other enterprise systems as we will describe shortly.
Topic maps are most frequently used as an adjunct to CMSs, which is not surprising given the features that they bring in terms of content organization, indexing, and searching. However, topic maps can also be used as the foundation for integrating all sorts of data sources.
The key to enterprise information integration using topic maps is to define a core ontology and then to map data from each data system into that ontology. For example, an ontology might contain the concepts of a Customer and an Order, but it is the Customer Relationship Management (CRM) system that contains information about the customer's calls to the help desk and the order-tracking system that contains information about the status of the customer's orders. An approach to integrating these systems based on topic maps could be either centralized or distributed. These different approaches are shown in Figure 3.
In a centralized system, each data system provides information about the entities that it manages to a central topic map. The information may be published by a push from the data system (see part a in Figure 3) or updated using a pull from the central topic map management application (see part b in Figure 3). Other applications can then consume that centralized information as topics, associations, and occurrences, thus shielding consumer applications from a need to know the interfaces required to communicate with each external system. In these systems data replication needs to be handled with care to ensure that it is clear where the master for any data item resides.
While it is possible for applications to generate topic map data, the range of applications that can create or consume this data source is limited. The idea of the Web service is to open up topic maps to more applications regardless of whether they are publishing or querying information. In addition, a well-defined Web service brings a valuable commonality to the application in which knowledge services interact.
The Web service presented here is comprised of a small set of generic methods that ensure that it can be used in a wide variety of topic map solutions. We briefly describe the Web service methods and provide commentary on their applicability and intended use. This interface is implemented by Networked Planet's Topic Map Web Service WSDL (see Resources), which also implements two further methods that allow more direct access to any hierarchies contained in the topic map.
GetTopicMapNames()—A topic map system can store and manage many topic maps. This operation returns a list of names of the topic maps currently available. In addition, it is possible to have this service aggregate or act as a broker over multiple distributed topic maps. This method provides a way to find out which topic maps are available.
GetTopic(TopicMapName, TopicId)—This operation returns a serialization of the specified topic from the named topic map. The serialization provides the caller with enough information that it can traverse to related topics.
GetTopicTypes(TopicMapName)—This operation returns a serialization of a subgraph of the topic map in which each topic that serves as a type of one or more topics in the specified topic map is fully serialized.
GetTopicBySubjectIdentifier(TopicMapName, Identifier)—This operation returns a subgraph of the specified topic map, in which the topic with the specified identifier as its subject identifier is fully serialized.
GetTopicsByType(TopicMapName, TopicTypeId)—This operation allows users to fetch a list of all the topics that are instances of the given type. The return value is a subgraph of the topic map with each instance topic fully serialized.
DeleteTopics(TopicMapName, TopicId)—This operation specifies the unique topic identifier of one or more topics to be deleted from the specified topic map.
SaveTopic(TopicMapName, TopicMapFragment)—This operation can be used to both update an existing topic and add new topics. The topic map data contained in the TopicMapFragment parameter acts as the definitive version of topics and associations. To replace an existing topic or association, that construct in the TopicMapFragment parameter must carry the unique identifier of the construct to be replaced. Any topic or association in the TopicMapFragment parameter that does not carry a unique identifier is added as new data to the topic map.
Query(QueryName, QueryParams)—The Web service does not allow the execution of arbitrary queries against the topic maps. Instead, the administrator or developer may configure any number of named queries that are accessible through the Web service API. This operation invokes the named query, passing in parameters that are used as direct value replacements in the query string. The structure of the XML returned from this method depends on the results table structure returned by the query.
The key feature of this interface is that it is almost entirely topic centric. All operations are on and return topics or lists of topics. Using the serialization techniques described here, it is possible to implement this Web service using a document-centric approach with XML data that is easily processed using XML data binding or XPath toolsets.
In a distributed system, each data system has an integration component that exposes a topic map interface, and consumers contact the systems through that interface (see part c in Figure 3). Again, consumers are only required to understand one interface, but in this case calls to the topic map interface could be translated directly into queries and updates against the underlying information system, which means that there is no data replication issue but can lead to a more demanding integration task.
Figure 4. Topic map broker architecture
Figure 5. Topic map broker architecture
In either a centralized or distributed system, there needs to be a strong emphasis on the identification of the entities that each system manages. It is a simple matter to map entity-unique identifiers such as a customer account number or an order-tracking number into a Uniform Resource Identifier (URI) for the topic that represents that account or order, and with a little care identifiers can be constructed such that there is a common algorithm for converting between the URIs and the entity-specific identifiers.
The principal benefit of using a topic map for this sort of integration exercise is the flexibility that it allows for the modeling of the integrated data. As the model is simply data in a topic map and not a schema for any underlying system, new types of entities can be introduced without any need to alter any existing integration interfaces. Additionally, by representing core business entities as topics in a corporate ontology, the way is made clear to perform and integrate data from enterprise information systems into the knowledge management system or even, where appropriate, out onto the Web.
All applications that make use of topic maps, including all of those presented previously, require some method to access the topic map information. Most applications also require a persistent store for topic map data and in-process API to access and manipulate that data, but these are beyond the scope of this discussion. What is of more interest to the solutions architect is the ability to access topic map information through Web services calls. Topic maps have several properties that make them highly conducive to access through Web services. Let's take a look at them.
Topic addressability. Topics can be assigned server-unique (or if necessary, globally unique) identifiers. The topic addressability property allows us to build client applications that can track the provenance of the topic information they consume. Topic addressability also allows us to traverse association or typing information in the representation of a topic. For example, a query from a client might retrieve topic A with an occurrence that it is typed by topic T. A second query from the client can then retrieve all information about topic T. In the sample Web service described in the sidebar, "A Simple Topic Map Web Service," topics can be retrieved by their unique identifier using the GetTopic() method.
Concept addressability. The concepts that topics represent can be assigned separate unique identifiers, allowing a query against multiple servers for information relating to a specific concept. Concept addressability is the key to supporting the distributed creation and maintenance of topic maps and the subsequent aggregation of the information in those maps. Because a concept (such as Person, Place, Fred Jones, or Birmingham) can be assigned its own URI separate from the system identifier of the topic that represents that concept, multiple systems can provide information about the same concept and use the concept identifier as the key used in aggregation.
For example, a query from a client might return a topic T with a subject identifier I. The client could then query a second topic map server for any topic with the subject identifier I to expand the amount of information known about the concept represented by that identifier. The advantage of concept addressability is that the client application does not have to know the system-specific identifier of the topic with the subject identifier I. In the sample Web service, this form of addressability is provided by the GetTopicBySubjectIdentifier() method (see the sidebar, "A Simple Topic Map Web Service").
Document-literal topics. Topics can be rendered easily as data structures that make use of globally unique topic identifiers or hyperlinks to represent the relationships among them. In SOAP terms, that means that we can derive a simple document-literal representation of a topic. If Representational State Transfer (REST) is your preferred Web service methodology, it is possible to construct a topic representation as a hyperlinked document sent in response to a REST-ful query. We describe the algorithm for producing such a representation shortly.
Standard merging rules. The merging rules of topic maps can be used to combine topic information received from multiple separate sources into a single functioning topic map. For topic map consumers, the merging rules allow consumers to use concept addressability to find all information relating to a concept and then combine that information into a single topic that is presented to higher levels of the application. This merging allows a client to aggregate information from multiple topic map sources, and to then present an interface that makes it appear that all of this aggregated information has come from a single source. The merging rules also allow concept addressability to be taken one step further as it allows a topic map to declare two concepts to be equivalent by simply including the address of each concept on a single topic.
Figure 6. Topic map subgraph serialization (Click on the picture for a larger image)
For example, a client could query for any information relating to a concept with the identifier I, the topic map returns a topic T that has the identifiers I and I' as its subject identifiers, which tells the client that the source is claiming that the concept identified by I is the same as the concept identified by I'. If the client trusts the source to make this sort of claim, it may then proceed by querying for any information relating to the concept with the identifier I' and merging this with the information already received for the concept with the identifier I.
Topic maps can be quite easily used with almost any client access architecture. We will examine three common architectures: client/server, broker, and syndication architectures.
The simplest of the topic map access architectures, the client/server architecture, makes use of a Web service interface that exposes operations to browse, query, and update the topic map. (See the sidebar, "A Simple Topic Map Web Service" for a description of one possible interface that consists of just eight methods.) As topic maps can be easily serialized as XML, there is no issue in using SOAP, REST, or even XML-RPC to implement such an interface.
The broker architecture interposes one or more additional servers between the source(s) of topic map data and the consumers (see Figure 4). A broker aggregates results from several servers performing any necessary merging and responding to a client as if the data had come from a single topic map. The aggregation performed by a broker may vary from simply distributing an operation and aggregating the results to more complex aggregation based on the subject identifiers returned from the various topic map sources.
The syndication architecture makes use of REST to distribute changes to a topic map model. The server that maintains the model simply writes changes as transaction documents containing serialized topic map subgraphs. These transaction documents are then picked up by consumers and applied to their local cache of the model. Syndication architectures work particularly well for distributing ontology topic maps that are relatively stable or for distributing topic maps that index content that is published to a regular schedule (for example, a news Web site's updates). The XML serialization of topic map subgraphs means that this architecture can make use of syndication standards such as ATOM or RSS 2.0 to distribute transaction data.
Figure 5 shows an example of how a syndication system might work to keep synchronized two topic maps managed by different engines:
- A change made to topic map engine A is written as an XML document to the syndication server. The syndication server makes a feed of recent transaction documents available to syndication clients.
- The syndication client checks the feed on the syndication server and requests the transaction(s) it needs to apply.
- The syndication client processes the transaction documents received from the syndication server and applies the changes to topic map engine B.
- Clients interact with the updated topic map on topic map engine B, which is now synchronized to the last known state of the topic map on topic map engine A.
Step 2 could also be performed by a push of syndicated transaction information from the syndication server to the client; the mechanism used would depend on the application requirements and the facilities provided by the syndication mechanism used.
Applications that access topic maps frequently require a result set that consists of a single topic (or a list of topics) that match a query. However, the topic map model is essentially a graph model in which topics are connected either through associations or through typing relationships, so when returning a topic it is necessary for the server to provide some context for the client. Essentially a server is required to extract a subgraph from the topic map graph.
In our experience the best way to go about serialization is to start with the concept of two types of topic serialization. A full serialization presents all of the information directly connected to a topic: its types, identifiers, names, occurrences, and all of the associations in which it participates. A partial serialization presents a minimal set of information that can be used by a client. Exactly what information is present in a partial serialization may vary from one implementation to another; for some implementations just a unique topic identifier might be sufficient, but other implementations may require all identifiers present on a topic plus a name or occurrence of the topic chosen according to some algorithm. The principal guide in determining what is present in a partially serialized topic is that a partial serialization should not contain any references to other topics; thus, the partially serialized topics form the leaves of the subgraph returned by a serialization.
With these two definitions, subgraph extraction is a relatively straightforward task. To extract a small subgraph centered on a topic, perform a full serialization of that topic. For each topic that the fully-serialized topic references, create a stub serialization of the topic, and replace the reference to it with a reference to the stub.
Larger subgraphs can be extracted by specifying a breadth parameter that defines a maximum number of associations to traverse. Every topic that can be reached by traversing a number of associations up to the breadth parameter from the starting topic should be fully serialized, and all other referenced topics serialized as stubs. Figure 6 shows an example of the serialization of a topic map subgraph using two different breadths of subgraph.
Part a in Figure 6 shows that the serialization of topic A is performed with a breadth of 0, so only topic A is fully serialized. However, to serialize the associations that topic A participates in, each of the topics B, C, and D must be partially serialized. Note that topic C has an association to topic D, but because C is only partially serialized, that association is not part of the serialized subgraph even though its ends are. Part b in Figure 6 shows the serialization of topic A with a breadth of 1. In this serialization, topic A and all topics that can be reached by traversing one association starting from topic A (that is, topics B, C, and D) are all fully serialized. To perform the full serialization of topic B, topic E must be partially serialized, and to perform the full serialization of topic C, topics F and G must be serialized. Topic C is only connected to topics that are fully serialized, so no additional information is required to serialize the associations it participates in.
Having identified the subgraph of the topic map to be extracted, all that remains is to serialize the data in that subgraph as XML. Although the topic maps standard does define an XML interchange syntax, it is a syntax better tuned for authoring and interchange of whole topics and does not provide any syntax for distinguishing between fully and partially serialized topics. It is therefore necessary to define a separate schema for serialization of topic map subgraphs. In designing our own schema, we also took the opportunity to simplify the XTM syntax to remove authoring convenience features, resulting in a schema that can be more easily processed by XML data-binding tools and by XSLT/XPath.
Kal Ahmed is cofounder of Networked Planet Limited (www.networkedplanet.com), a developer of topic map tools and topic maps-based applications for the .NET platform. He has worked in SGML and XML information management for 10 years, in both software development and consultancy, and on the open source Java topic map toolkit, TM4J, as well as contributed to development of the ISO standard.
Graham Moore is cofounder of Networked Planet Limited, and he has worked for eight years in the areas of information, content, and knowledge management as a developer, researcher, and consultant. He has been CTO of STEP, vice president of research and development at empolis GmbH, and chief scientist at Ontopia AS.
Topic Map Web Services
"White Paper: Topic Maps in Web Site Architecture," (Networked Planet Ltd., 2005)
"TMShare: Topic Map Fragment Exchange in a Peer-to-Peer Application," Kal Ahmed
XML Topic Maps (XTM) 1.0
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.