Using the Crawl Scope Manager

The Crawl Scope Manager (CSM) is a set of interfaces that provide methods to inform the Windows Search engine about containers to crawl and/or watch, and items under those containers to include or exclude in the catalog. Administrators can view all users' indexes, search roots, and scope rules using the CSMl.

While you can use the CSM's APIs to define a crawl scope programmatically for new protocol handlers or containers, the CSM was designed with user interfaces in mind. For example, let's say you have developed a protocol handler for a new data store, and you want to let users or administrators include/exclude particular paths within the data store for indexing. You can use the Crawl Scope Manager to set one or more search roots (URLs to particular containers, e.g., file:///C:\MyContainer\) for your data store, and the user interface will display the search root(s) with a check box. Users can then include or exclude that path or the children of that path. These inclusions and exclusions are the search scope rules that the Indexer uses to determine what to index.

This topic includes the following subjects:

  • Overview of the Crawl Scope Manager
  • Group Policies Supported by the Crawl Scope Manager
  • Managing Search Roots
  • Managing Scope Rules
  • Related Topics

 

Overview of the Crawl Scope Manager

It would be helpful to begin with a few definitions. A crawl scope is a set of URLs pointing to containers (email data stores, databases, network file shares and so on) that the Indexer crawls in order to index items for the catalog. A root, or search root, is the top-level URL for a container. For example, suppose you want to identify a folder, WorkteamA\ProjectFiles. The search root for that might look like this: file:///C:\WorkteamA\ProjectFiles\.

A scope rule is a rule that includes or excludes URLs within a scope root from being crawled and indexed. For example, suppose you want everything within the ProjectFiles folder indexed except for the subfolder Prototypes. You would need an include rule for file:///C:\WorkteamA\ProjectFiles\ and an exclude rule for file:///C:\WorkteamA\ProjectFiles\Prototypes\.

The Crawl Scope Manager (CSM) is a set of APIs that lets you add, remove, and enumerate search roots and scope rules for the Windows Search Indexer. When you want the Indexer to begin crawling a new container, you can use the CSM to set the search root(s) and scope rules for paths within the search root(s). For example, if you install a new protocol handler, you could create a search root, add at least one inclusion rule, and then the Indexer can start a crawl for the initial indexing. The CSM offers the following interfaces to help you do this programmatically.

Group Policies Supported by the Crawl Scope Manager

System administrators can define crawl scopes across their organizations using Group Policies. These group policy rules can also act as default rules which users can override. For example, you can have one set of directories indexed for one group of users and a different set for another group of users, allowing the users to deselect these defaults. Group policy rules can also act as forced exclusion rules which users cannot override, preventing certain users from indexing certain network shares, for example.

Group policy rules are available only with Windows Search 3.01 or later. For more information on Group Policy, refer to the Windows Search 3.01 Administrator's Guide.

Managing Search Roots

A search root identifies the top-level URL for a container, or content store, that can be indexed by the Indexer. It doesn't specify which parts of this store should or shouldn't be indexed; it merely signals that a content store exists and is associated with a registered protocol handler. The syntax of a search root includes a protocol, a site or user security identifier, and a path to the location(s) to be crawled, and is described in greater detail in Managing Search Roots.

Search roots can identify locations that are specific to a user, are on a remote machine, or that match a wildcard pattern. Roots specific to a user must include the user's security identifier (SID) while roots for a remote machine include the machine's name. Roots can also include the wildcard character '*': "file:///C:\ProjectA\ProjectFiles\*.myext" or "file:///C:\ProjectA\*\Data\".

You should create new search roots for containers not already in the Indexer's crawl scope. For example, if path C:\ParentScope is already included in the Indexer's crawl scope, then you do not need to add a new search root for C:\ParentScope\ChildScope unless you know that the child scope had been previously excluded.

The Windows Search Options user interface, or any other client user interface, displays search roots to users so they can refine the scope rules for their searches. As part of the installation process for a custom protocol handler, container, and/or application, you might define a default crawl scope with inclusion and exclusion rules. These roots and rules may be presented to end users as locations and check boxes in the Windows Search Options dialog. Users can navigate within the subdirectories of your pre-defined search root and check the paths they want to include in searches or uncheck the ones they want to exclude.

Refer to Managing Search Roots for instructions on adding, removing, and enumerating search roots.

Managing Scope Rules

A scope rule is a rule that includes or excludes URLs within a search root from being crawled and indexed. You could, for example, create a set of rules that includes C:\ParentScope\ and all subfolders except C:\ParentScope\ChildScope. Inclusion rules dictate that the Indexer include that URL in the scrawl scope, and exclusion rules dictate that the Indexer exclude that URL from the crawl scope.

There are three types of rules, taking the following order of precedence:

  • Group Policy Rules are set by administrators and can override all other rules.
  • User Rules are set by a user modifying the scope in the Search Options dialog or by another application managing the scope. Users or other applications can also remove all user-set rules and revert to default rules.
  • Default Rules are typically set by an application to define a default crawl scope. For example, default rules might be set when a new protocol handler or container is added to the system.

Together, these types of rules comprise the working rule set from which the CSM generates the full list of URLs to crawl. The Indexer then crawls these URLs and adds items, properties, and content to the catalog. When you add a new scope rule and the Crawl Scope Manager determines that rule already exists (based on the URL or the pattern provided), the old rule is replaced by the new rule and any user-set rules that contradict it are removed.

Note  Users with access to the Control Panel can modify the working rule set through that interface. Therefore, applications offering scope management should always get the rules directly from the CSM using the enumeration methods instead of relying on its own saved copy of the user's rules.

Refer to Managing Scope Rules for instructions on adding, removing, reverting, and enumerating scope rules.