Enterprise Search Architecture
Enterprise Search in Microsoft Office SharePoint Server 2007 is a Microsoft Office SharePoint Server 2007 shared service that provides extensive and extensible content gathering, indexing, and querying. This service supports full-text searching using a Structured Query Language (SQL)-based query syntax, and provides a new keyword syntax to support keyword searches.
Enterprise Search uses the same underlying Search service as Search in Windows SharePoint Services.
This topic provides information about the internal architecture of Enterprise Search, as well as Enterprise Search as a shared service.
The following figure provides a detailed view of the Search service internal architecture.
Following are the components of the Search service's architecture.
Index Engine Processes the chunks of text and properties filtered from content sources, storing them in the content index and property store.
Query Engine Executes keyword and SQL syntax queries against the content index and search configuration data.
Protocol Handlers Opens content sources in their native protocols and exposes documents and other items to be filtered.
IFilters Opens documents and other content source items in their native formats and filters into chunks of text and properties.
Content Index Stores information about words and their location in a content item.
Property Store Stores a table of properties and associated values.
Search Configuration Data Stores information used by the Search service, including crawl configuration, property schema, scopes, and so on.
Wordbreakers Used by the query and index engines to break compound words and phrases into individual words or tokens.
The index engine uses a pipe of shared memory to request that the Filter Daemon begin filtering the content source. For the crawl process to succeed, the content source must have an associated protocol handler that can read its protocol. The Filter Daemon invokes the appropriate protocol handler for the content source based on the start address provided by the index engine. The Filter Daemon uses protocol handlers and IFilters to extract and filter individual items from the content source. Appropriate IFilters for each document are applied, and the Filter Daemon passes the extracted text and metadata to the index engine through the pipe.
At this point in the content crawling process, the index engine saves document properties to a property store separate from the content index. The property store consists of a table of properties and their values. Properties in this store can be retrieved and sorted. In addition, simple queries against properties are supported by the store. Each row in the table corresponds to a separate document in the full-text index. The actual text of a content item is stored in the content index, so it can be used for content queries. The property store also maintains and enforces document-level security that is gathered when a document is crawled.
At this point, the index engine uses wordbreakers and stemmers to further process the text and properties picked up during the crawl. The wordbreaker component is used to break the text into words and phrases. The stemming component is used to generate inflected forms of a given word. The index engine also removes noise words and creates an inverted index for full-text searching.
Search Query Execution
When a search query is executed, the query engine passes the query through a language-specific wordbreaker. If there is no wordbreaker for the query language, the neutral wordbreaker is used, which does whitespace-style wordbreaking, which means that the wordbreaking occurs where there are whitespaces in the words and phrases. After wordbreaking, the resulting words are passed through a stemmer to generate language-specific inflected forms of a given word. The use of wordbreaker and stemmer in both the crawling and query processes enhances the effectiveness of search because more relevant alternatives to a user's query phrasing are generated. When the query engine executes a property value query, the index is checked first to get a list of possible matches. The properties for the matching documents are loaded from the property store, and the properties in the query are checked again to ensure that there was a match. The result of the query is a list of all matching results, ordered according to their relevance to the query words. If the user does not have permission to a matching document, the query engine filters that document out of the list that is returned.
Search as a Shared Service
A shared service is a high-value application that is consumed by other applications. In the Office SharePoint Server 2007 logical architecture, a Shared Services Provider (SSP) is a grouping of shared services and related shared resources. An SSP is created and configured to host shared services by a server farm administrator so that they are available to multiple portal sites within a farm. The farm administrator then assigns an SSP to a portal site. A farm can contain multiple SSPs, but a portal site can only be associated with one SSP. An SSP can only have one instance of a particular shared service.
Enterprise Search Manageability
In SharePoint Portal Server 2003, you managed crawl configuration and content indexes separately for each portal site. In Enterprise Search, you manage all of this at the SSP level, with one content index and one property store per SSP. This prevents redundant indexing, and centralizes the administration of resource intensive operations such as index management, enhancing the manageability of Enterprise Search.
Some search settings are still configurable at the site collection level; for more information, see the Site Level Search Manageability section of this topic.
The next sections provide a brief overview of the different parts of Enterprise Search shared service in Office SharePoint Server 2007.
A content source is a collection of start addresses representing content that should be crawled by the search index component. A content source also specifies settings that define the crawl behavior and the schedule on which the content will be crawled.
Enterprise Search provides several types of content sources by default, so it is easy to configure crawls to different types of data, both internal and external. Following are the content source types included in Enterprise Search:
File share content
Exchange folder content
Business data content
If you need to include other types of content, you can create a custom content source and protocol handler for Enterprise Search.
A Lotus Notes content source is available, however, it is not configured by default.
For more information about content sources, see Content Sources Overview.
A search scope provides a way to group content items together, based on a common element among the items within that search scope. This helps users make their searches more relevant by allowing them to focus their search on a subset of content in the index, instead of searching the full index. A scope plays an important role in the ability of Enterprise Search to support diverse search experiences from one content index. After you create a search scope, you define the content to include in that search scope by adding scope rules, specifying whether to include or exclude content that matches that particular rule. You can define scope rules based on the following:
You can create and define search scopes at the SSP level or at the individual site collection level. SSP level search scopes are called shared scopes, and are available to all the sites configured to use a particular SSP.
For more information about search scopes, see Working with Search Scopes.
Document Property Mappings
The Enterprise Search schema is comprised of two types of properties, crawled properties and managed properties, as well as the mappings between the two sets of properties.
The index engine extracts crawled properties from content items when crawling content. These properties are grouped into different property categories based on the protocol handler and Ifilter used. For example, crawled properties from content in the Business Data Catalog are grouped in the Business Data category; crawled properties from 2007 Microsoft Office system content are grouped in the Office category.
Managed properties are the set of properties that are part of the search user experience, so to include a crawled property value in search functionality, it must be mapped to a managed property in the Document property mappings. Managed properties are created and managed at the SSP level. For more information, see Managing Metadata.
Server name mappings are crawl settings you can configure to override how search results are displayed or accessed after content has been included in the index. For example, you can configure a content source to crawl a Web site via a file share path, and then create a server name mapping entry to map the file share to the Web site's URL.
Relevance settings affect how relevance rankings for items are calculated, which affects the order that search results appear in a search results list. Improving relevance for search results is a major focus for this release. Enterprise Search includes an updated ranking engine, specifically tuned for searching enterprise content and line-of-business (LOB) application data.
The following are included in the updated relevance calculation:
Hyperlink anchor text
URL surf depth
URL text matching
Automated metadata extraction
Automatic language detection
File type relevancy biasing
Enhanced text analysis
For more information about Enterprise Search relevance, see Improving Relevance.
File Type Inclusions
The file type inclusions list specifies the file types that the crawler should include or exclude from the index. For more information see the File Type section in Defining Crawl Rules and File Types.
The information tracked in the query log includes:
The query terms being used.
If search results were returned for search queries.
Pages that were viewed from search results.
This search usage data is beneficial in understanding how people are using search and what information they are seeking. You can use this data to help determine how to improve the search experience for users.
The crawl log tracks information about the status of crawled content, and contains the current status of every item in the content index. You can browse and filter the entries in the crawl log to see errors, warnings, and so on to help you track whether content was added successfully to the index. For more information, see Working with the Crawl Log.
Site Level Search Manageability
While most of the search experience is managed at the SSP level, there are some items that are available at the site level, as follows:
Keywords and best bets
Settings at the site level provide a site administrator with the ability to configure the search user experience without adversely affecting the search experience of other sites configured to use the same SSP.
As described earlier, search scopes are a collection of items grouped together based on a common element among the items within that scope, which help users broaden or narrow the scope of their searches. Search scopes available at the SSP level are called shared scopes. Search scopes are also available at the site level. Search scopes created at the site level are only visible to the site they were created in, and to subsites within the top-level site.
When managing search scopes at the site level, you can create and configure scope display groups. Display groups organize groups of search scopes by how they appear on the site. For example, if an SSP administrator had created a shared scope at the SSP level, and you wanted to display this shared scope in the scopes drop-down list for the Search Box Web Part, you would add the new shared scope to the Search Dropdown display group for the site. For more information on how to do this, see How to: Display a Search Scope in the Search Box and Advanced Search Web Parts.
Keywords and Best Bets
Keywords are words or phrases that site administrators have identified as important. They provide a way to display additional information and recommended links on the initial results page that may not otherwise appear in the search results for a particular word or phrase. For more information, see Managing Keywords.