Introduction to Microsoft Index Server
Windows NT Query Team
October 15, 1997
This is the second of a series of articles to help you understand and effectively deploy Microsoft's search solutions on your Web sites and intranets. The first article, Anatomy of a Search Solution, helped you understand what to expect of a search solution to meet your site's needs. This article is designed to help you understand the features and capabilities of Microsoft® Index Server, a Microsoft Internet Information Server (IIS)–based solution for a single-server Web site.
The first version of the product was released in August 1996, followed by an incremental revision in December 1996. Version 2.0 of Index Server is scheduled for release with the Windows NT® 4 Option pack.
Index Server Overview
The engine at the heart of Index Server started out as Content Indexer (CI), an integral part of Object File System (OFS), which was developed as part of a Windows NT operating system. Windows NT technologies have been integrated into a number of products. CI is one such technology that has found its way into Index Server. As part of the file system, CI was designed from its conception to a different and higher set of standards of reliability, efficiency, and robustness than that of a typical search solution. After all, we hold OS components to a much higher standard than what we expect of a typical application.
Index Server is a solution built for the Web around CI. As of October 1997, Microsoft has shipped Index Server versions 1.0 and 1.1 as a Web download for Microsoft Windows NT Server 4.0 and Internet Information Server (IIS) 3.0. As a Web service, it is designed to be highly configurable, provide secure access to your corpus, reliably operate round-the-clock, require minimal intervention, and work seamlessly with IIS. As a search solution, Index Server is designed to have little resource overhead, automatically detect and index a dynamic corpus, provide round-the-clock querying capability, index content and properties of documents in several languages, and be extensible.
Query formulation and result browsing can be accomplished using any standard Web browser. Index Server, however, is a complete, end-to-end, IIS-based solution for a Web site or an intranet. Version 2.0 includes a server-side query object for use with Active Server Pages (ASP) for improved programmability. In addition, Index Server is integrated with the IMAP and NNTP components of IIS, which it uses to index e-mail and news files and process queries by IMap4Rev1-compatible mail clients, such as Microsoft Outlook® Express and Netscape Messenger 4.0.
Index Server is designed to gather documents residing on multiple web servers and can satisfy the search solution needs of most Web sites. However, if your document corpus is highly distributed or otherwise needs to be gathered by a Web crawler using a protocol such as HTTP, Microsoft SharePoint Portal Server may work better for you.
There are no built-in limits on the number of documents that can be handled by Index Server. It is known to have handled several million documents filling several gigabytes of disk space. The processing power and memory capacity of servers running Windows NT Server is continuously increasing, so the capacity handled by Index Server is always reaching new limits. But if your site or intranet is too large to be handled by a single Web server, Index Server may not be for you. That is because Index Server is not designed to spawn queries to multiple servers and seamlessly collate resulting hits from target servers. You can write scripts to simulate seamless integration of multiple search servers, but be warned that that may not be as efficient as you would like.
Description of Index Server Components and Features
Most component and feature descriptions in this section are poured into the search solution mold introduced in my article Anatomy of a Search Solution, Reading that article first will help you make the most of this one.
The Document Corpus
The corpus collectively refers to all the documents you want to include in your index. This subsection provides a description of all the features provided by Index Server that relate to the composition and characteristics of your corpus.
Index Server provides the same security that Windows NT provides. Every file and directory of an NTFS volume can be secured. These security attributes are indexed by Index Server and are used to enforce access rights of the IIS client issuing the query. For example, salaries.xls is a file that can be accessed by user A, but not by user B. A query from A's account that matches salaries.xls contains that file in the hit list, but the hit list for the same query from B's account does not have any evidence of the existence of that file.
Index Server uses the IFilter interface to extract the content and the properties of a file. This interface is a general-purpose document filter and is documented in the Microsoft Platform Software Development Kit (SDK). Modules that know a specific file format and can expose the content and properties of those files using the IFilter interface are called filters for that format. Index Server ships with filters for common file formats such as HTML and Microsoft Office documents (files with .xls and .doc extensions, for example). Owners of proprietary file formats or third-party vendors may provide filters for their formats so the contents can be made available to users of the IFilter interface. Search the Internet with a query such as "IFilter Index Server filter" and you will find information about filters from third-party vendors and file format owners. If you have a proprietary file format, you may develop your own filter for that format. Thus, Index Server gives you the ability to index documents stored in almost any file format.
It is very likely that your site has documents written in several languages, some of which have multiple languages interspersed within them. Index Server supports multiple languages using a set of modules (Windows DLLs) and language-specific files (such as a noise word list) for each language. Languages currently supported are:
All characters stored in the index are in Unicode and all queries are converted to Unicode to enable a language-independent comparison.
Index Server has no built-in limits on the number of documents it can handle or the total size of the corpus. For all practical purposes, it is only limited by the hardware and the associated systems, such as Windows NT Server and IIS. As of August 1997, it is known to have satisfactorily handled several million documents of a corpus spanning several gigabytes of storage space. Different sites have different needs, so the best way to find out if your site's scalability requirements are satisfactorily handled is to try Index Server out.
The corpus indexed by Index Server is organized as a set of scopes. Scopes are directories (local and remote) in the Windows NT file system or IIS virtual directories. You specify a set of scopes that comprises your corpus along with an optional set of scopes that should be excluded from the corpus. Index Server traverses the directories to determine what files belong to the corpus. Each file is treated as a single document. If your document is spread across multiple files, it should be placed in a single file to make Index Server index it as one document. On the other hand, if you have multiple documents in a file, you should split that into multiple files to make Index Server see multiple documents instead of one.
Index Server can detect additions, deletions, and modifications to your corpus and automatically schedule these changed documents for indexing. When available, it relies on the built-in notification mechanism provided by the file system.
As you may have surmised by now, Index Server gathers documents that are individual files in the file systems supported by Windows NT. There is no provision to crawl the Internet or an intranet to gather documents of interest. Of course, nothing prevents the perceptive reader from employing a Web crawler to gather documents of interest into one or more Index Server scopes. However, a discussion of such possibilities (and be aware that many such exist) is beyond the scope of this article.
As Index Server enumerates files that should be indexed, it uses the file extension (.xls, .doc., and so on) to determine the filter DLL to unleash on the file. Association between file extensions and the corresponding filter DLLs may be established through the registry. Files with certain extensions may not have anything of interest to index, and such extensions (such as .exe) may be excluded through the registry. Index Server does not attempt to extract the contents of files with an excluded extension, but it does retrieve their properties. As mentioned earlier, filters use the IFilter interface to expose the contents of a file. Currently, Index Server ships with filters for HTML, plain text, and Microsoft Office formats.
Filters extract the content and properties (author, creation date, and so on) of the document. Index Server indexes both content and properties. Note that from Index Server's point of view, a document is what the filter says it is. So pay attention to your filter if you have special needs.
Recognizing Features of Your Content
Documents are much more than a stream of characters. They contain concepts threaded into a stream of thought using syntactic and semantic constructs of varying complexity provided by the language. Text processors implemented in software are generally not capable of "understanding" natural language documents as thoroughly as we humans can. The problem is more acute in processors that ship with general-purpose search solutions. There are several layers of abstraction in a document and different text processors differ in the level of abstraction they can handle.
Index Server supports multiple languages, so there are several sets of modules (one set per language) that recognize concepts in text. The Index Server word breaker recognizes individual words. The Index Server stemmer can extract the root form of a given word, which enables Index Server to group words of the same stem together. For example, "swimming", "swam", and "swimmer" are word variants of the word "swim". This ability to group words based on word stems enables Index Server to go one level of abstraction beyond simple words. That enables your user to concentrate on the concept of "swimming" instead of attempting to pick the right word that may have been used in the target document.
A higher level of abstraction is to be able to extract noun phrases (for example, "White House"). The concept explicitly expressed by "White House" is very distinct from the concepts expressed by "White" and "House". This phrase-recognition capability enables your users to lock on to documents containing "White House" while ignoring documents that merely contain the component words. Beyond that, natural language processing has not matured enough to be of much use in a general-purpose search solution. An ability to plug in special-purpose linguistic components would have been convenient for those who need it, but Index Server versions up to 2.0 do not have that provision. Of course, you can develop a query-formulation front-end that can perform any processing you want on the query.
Not all words in a document are worth indexing. To enhance the quality of matches returned in response to a query, you can remove such noise words by creating a stop list. In addition, removing noise words significantly reduces the size of the index. Index Server provides a conservative list of noise words for each language. You may edit this list to suit your needs.
Indexing happens in the background only when there are sufficient system resources available to index without adversely affecting system performance. Parameters controlling indexing performance can be set by the administrator to tune performance to the site's needs. An introduction to Index Server's indexing-related features follows.
Recall that Index Server was designed to be part of a file system. Files are constantly changing in a file system, so Index Server has been designed to seamlessly assimilate constantly changing documents.
Index Server's indexing happens in multiple stages. Processed documents move from simple in-memory structures to sophisticated and efficient on-disk indexes in multiple stages. This gradual transition allows the documents to be available for search as soon as they have been shred to pieces by the word breaker. Index Server strives to make the index available at all times.
Index Server requires minimal human intervention during its day-to-day operation. Abnormal events, such as network failures, unexpected resource usage (such as a full disk), and power interruptions happen often enough in an operating environment to warrant "defensive indexing."
There are different levels of defensive indexing. The bottom-most level guarantees data integrity, which ensures that the documents being indexed are not rendered unusable due to exceptional conditions. The next level guarantees data veracity, which ensures that documents that are claimed to be in the index are retrieved in response to appropriate queries and those not yet in the index will be assimilated in due course. The next level guarantees automatic operational continuity, which ensures that Index Server does whatever it takes to automatically continue from where it left off when it was last interrupted. Notice how we are climbing this pyramid of guarantees and gaining more conveniences as we scale the infrastructure provided by Index Server.
The next level guarantees that when Index Server resumes indexing, it does not have to redo what it already did before going down. For example, if the service is abruptly terminated during the resource-intensive master merge of indexes, it will resume the merge when it is restarted. This saves time, because having to redo even a portion of a production site is expensive and inconvenient.
The indexer provides several parameters to help you fine-tune its performance. Of course, they are initially configured to reasonable defaults. Later, as you gain a better understanding of your needs, you can flex your administrative muscle and dictate your own terms to the indexer. You can control how your corpus will be scanned and filtered, prioritize filtering and indexing processes and threads with relation to the rest of the system, control how much main memory is consumed by the indexes, and control when the in-memory indexes are merged with the master on disk.
The Query Language
Index Server sports a rich query language that can address a variety of information needs. It supports Boolean operators, proximity operators, and "fuzzy" operators on words. In addition, there are provisions to query using natural language text (also called free text, which is free of any operators and other commands) and to query on document properties using relational operators. You can also dictate the relative importance of components of your query, and there is a front end for extended Structured Query language (SQL) queries.
The Boolean operators and, or, not, and and not help you combine individual clauses to form a precise query. Boolean operators are so ubiquitous that they need no further explanation. Index Server only supports the and not operator, and not the not operator, on content queries. The not operator is supported on property queries.
Boolean queries retrieve accurate results. But that would be a lot like Spock—all accuracy and no intuition. If you know exactly what you need, and have the patience to express that as a Boolean query, you can get much mileage out of Boolean operators. Unfortunately, your average users aren't like that. They may not know exactly what they are looking for. Even if they did, they would have difficulty expressing their need as a Boolean query. Since Boolean queries are all-or-nothing propositions, they may end up retrieving too few documents (a rather narrow search) or too many (an overwhelmingly broad search).
To improve search constraints, Index Server supports a variant of the Boolean operators, called the proximity operator (near). Use this operator to query for words that are within a short distance of each other. A special case of nearness is the phrase, which expects the words to be available adjacent to each other, and in that order. The proximity operator is as accurate as the Boolean operators, yet significantly narrows the set of matching documents by virtue of the proximity constraint. It also fits well with the way we approach search engines. We skim over a lot of information during our busy daily lives and retain only a few concepts from this brief exposure. Later, when we need the information, we jog our memories and recall a few concepts. Many of these concepts are phrases such as names and titles. In recognition of this need, feature recognition employed by Index Server includes recognition of common noun phrases.
While searching, you are often looking for documents with certain concepts. For example, a document about computers can contain many expressions of the concept of computing (computers, computing, computer, computational, and so on) in its prose. Having to guess all the possible variants is no fun, so you are empowered with the simpleminded wildcard operator. You can use "comput*" in your query, where * is the wildcard operator, to get all the aforementioned expressions of the concept of computing. But what about the concept of swimming, which encompasses words such as swim and swam? The stemming operator is used to capture all words that have the same stem (the same as a word root). You can use "swim**" in your query, where ** is the stemming operator, to match all words that stem to that root form.
As you surmised by now, Boolean operators are very effective when your search space is small and has a few well-defined features. One way to reduce the search space is to tag documents with properties. Common examples of properties are author and subject keywords. The set of property values associated with a document sharply characterizes it and Index Server allows you to target them. The available set of operators includes relational operators and set membership operators.
Relational operators, such as greater than and less than, allow you to retrieve documents falling within a range of interest. Relational operators are so ubiquitous that they need no further explanation.
The set membership operators (all of, some of, any of), on the other hand, are not commonly available. They allow you to query for documents that contain attributes in relation to a given set of attributes. Consider the case of file attributes maintained by the Windows NT file system. Each file has attributes such as read-only, system, and hidden. If you are looking for documents that contain some file attributes, you can use set membership operators. For example, to query for all files with the archive bit on, your query will be "@attrib ^s 0x20" where "attrib" is the file attributes property, "^s" is the "some of" operator, and 0x20 is the bit pattern signifying the file has the archive attribute.
Natural language (free-text) queries
Your users may prefer to be "natural" with their computer and ask questions such as, "Tell me how to become a millionaire." Or they may be reading a paragraph and want more documents that have content similar to that of the paragraph. They may not know or don't care about the individual features of that paragraph that made the document relevant to their need.
With Index Server, they can submit a natural language query (also known as a free-text query) and hope that it will return more relevant documents to quench their thirst for knowledge. They don't have to know a thing about operators, operator precedence, and other such gory details of the query language. Index Server does its best to pick out features from the natural language query and processes the resulting query.
As time passes, Boolean and property queries will continue to truthfully return the same set of documents. Nothing exciting there! Your users will still have to learn how to make precise queries. But with natural language queries, advances in natural language understanding, user modeling, and so on will empower successive releases of Index Server to do a better job of figuring out the need expressed by the query.
Index Server always computes a score (or a rank, in Index Server terminology) for each of the matching documents. A higher score implies a better match. Sorting by this score helps us concentrate on the better matches. Usually all the words and phrases in your query have an equal share in determining the score of a document. But what if you consider some of them to be more important than others? For example, if you are searching for documents discussing characteristics of rivers likely to contain gold sediment deposits in them, you may want to give the words "sediment" and "river" more importance than other words. This would suppress documents that deal with deposits of gold elsewhere in favor of documents that are likely to be relevant to you. Well, you guessed it! Index Server allows you to influence the impact of individual query features on document scores. You can express them as a comma-separated vector of features. For example, "sediment*, river*, gold, deposit, water" is a valid vector query that treats "gold" with twice as much importance as "deposit". This is a rather simple example. You can compose more sophisticated queries and Index Server will gladly oblige.
You can combine several types of subqueries to form a compound query. This gives you the ability to use different operators for different document properties. For example, you may want to issue a natural language query against the body of text but you know that it is worth looking for that text only in text-oriented documents. You can express that as "(#filename *.doc OR #filename *.txt) AND $contents "How can I print on both sides of the page?"" The possibilities are endless.
Version 2.0 of Index Server provides an alternative method of issuing queries on an indexed file system. You can write SQL queries in applications that use ActiveX™ Data Object (ADO) controls. The SQL used with Index Server consists of extensions to the subset of SQL-92 and SQL3 that specifies queries on relational database systems. This SQL includes extensions to the SELECT statement and its FROM and WHERE clauses.
The Query Processor
Index Server's query processor is designed to be a robust document retrieval solution for an intranet. It is available to process queries as long as the service is up, and it supports users with different sets of information needs and access privileges.
The foremost concern of any document retrieval system ought to be security. Users should not be able to retrieve any documents they would not be able to access otherwise, say through the file system. Index Server authenticates the user account issuing the query and retrieves only those matching documents that the account has access to. Index Server ensures that it does not even acknowledge the presence of off-limit documents. For example, if a total of 20 documents matched the query, but the user account has access to only 10 of them, then Index Server reports only 10 matches to that user.
Perpetual index availability
The search component of the service is the first to be available, the last to go down, and is available at all times in between. The index maintained by the system is in a constant state of flux as it responds to changes in the corpus. The index is available at all times, even as it is changing in the background. And all changed documents are available for search as soon as they are filtered. The index is also available for search during routine maintenance tasks such as master merges (the process of flushing recently indexed documents to the persistent index) and during not-so-routine tasks such as recovery from corruption.
The corpus is logically organized as a set of scopes, where each scope is mapped to an IIS virtual root or a physical directory. All searches are scoped, which means you can limit the result set to contain documents from only one or more of the scopes known to the index. This provides a simple way to logically organize your corpus and target queries at portions of the corpus. For example, each department's documents can go into one scope and users interested in documents from a given department can have their queries scoped accordingly.
Optimized for type of access
Users have different requirements for fetching the matching results. Some may just need a set of results and can fetch them, one by one, as quickly as the search engine can retrieve them. Others may need a set of sorted results, forcing the engine to retrieve all hits before returning the sorted set. Given that the result fetch phase is usually the most time-consuming phase of a query session, there is much incentive to optimize it.
Index Server provides sequential queries for efficient forward-only scrolling and nonsequential queries for forward and backward scrolling. It also provides enumerated queries that completely scan the corpus to find matches in response to certain property queries. And then there is nondeferred trimming, which enables you to choose between speed and completeness when certain resource-intensive queries are executed.
The query processor provides several configurable parameters that allow you to fine-tune its performance. A sampling of the available parameters follows. You can cache properties of your choice in a property cache to improve result fetch speed. You can further improve the cache's performance by allowing it to use more main memory. You can also maintain a cache of recently executed queries and their results to speed up resolution of common queries against a static index. You can also control per-query resources such as maximum execution time, query complexity (as measured by internal resources used to resolve it), maximum size of the result set, and maximum simultaneously executed queries.
The Hit List
This list contains the documents matching the query and should have all the information needed to help the user determine whether a document holds any promise of being relevant. Index Server allows you to retrieve properties of your choice for you to display. Retrieved result sets can be sorted on the basis of one or more properties.
Browsing the Documents
As you would expect, Index Server can provide you the full path (which can be used to compose a URL) for each matching document. It is strongly recommended that you allow the user to peruse the document in its original form using a full-fidelity viewer. The nontextual content and the layout will help the user make the most of the document.
Systems that can explain their actions inspire confidence in their users. One way of explaining a search engine's actions is to highlight segments of the document that caused it to be chosen as a match. Index Server provides a list of hits within each retrieved document that you can use to highlight the words that were part of the original query. Index Server also provides an Internet Server API (ISAPI) module, WebHits, which generates HTML pages incorporating highlighted hits.
Configuring and Using Index Server
The main tasks of an administrator deploying Index Server include indexing documents in the corpus, monitoring its progress and performance, and providing a user interface to pose queries and retrieve matching documents. This section briefly discusses the components and features that let you do that.
Easy configuration and administration is one of the major design goals of Index Server. From the time you successfully install Index Server, you need to wait only a few minutes before you can issue your first query. After verifying that all is well, you can create new catalogs (Index Server's term for the index and related files) with a few keystrokes and mouse clicks using the Index Server administrative snap-in to the Microsoft Management Console (MMC). You can perform a variety of administrative tasks through the console. In addition, you can administer and monitor status using an Internet browser, view events using the Windows NT event viewer, and graphically monitor status and performance using the Windows NT performance monitor. And finally, if you want to log the traffic to your Index Server, use the logging mechanism provided by IIS. Standard IIS logging records query information such as the querying IP address and the queries posted to the server.
Starting with Index Server version 2.0, you can use Active Server pages (ASP), or .asp files, to capture the power and flexibility of ActiveX scripting. Queries created with ASP allow you to leverage scripting languages such as Microsoft Visual Basic® Scripting Edition (VBScript) and Microsoft Jscript™ to add flexibility in displaying query results. Index Server also provides an automation object that can be programmed with client-side scripting or through Microsoft Visual Basic. Client-side and server-side scripting provide great programmability and flexibility in building user interfaces. The disadvantage, particularly with server-side scripting, is that each query incurs additional overhead. It is the recommended path for installations that need a cool user interface and are willing to pay for that price at the server end.
An ISAPI component allows you to use Index Server–specific forms to compose a query and compose HTML pages with the resulting hits. This is an extremely efficient mechanism because the server incurs very little overhead. The drawback is that these forms provide limited flexibility in composing queries and hit lists. It is the recommended path for installations where server throughput is a higher priority than sophisticated and flexible user interfaces.
In addition to these programming features, version 3.0 of Index Server, slated to ship with Windows 2000, includes a programming interface to interact with the engine.
Now that you have an understanding of the features provided by Microsoft Index Server, you can analyze your search needs and determine how well it can stack up against your needs. If it does suit your needs, go ahead and deploy it! If you are not sure, go ahead and deploy it and check it out!
Subsequent articles in this series will focus on technical details concerning Index Server that will help you make the most of the product and the technology.
I would like to thank David Lee, Mohamed Namatalla, Nikhil Joshi, Sankrant Sanu, Sitaram Raju, and Kyle Peltonen for their valuable feedback.