1.3.2 Site Indexing Process

Article
10/05/2021

This protocol is designed to support a client application (an indexing service) that conducts a full or incremental scan of the site content following the recommended site indexing process described in the following diagram. This process is described in detail in section 3.1.4.

Site indexing process

Figure 2: Site indexing process

The site indexing process assumes that the indexing service is configured with the URL of the site to index, and that the site supports this protocol.

First, the indexing service establishes the indexing context. This requires identifying the content database used by the site to track changes, and obtaining a token (referred to as the StartToken) to be referred to in the incremental indexing stage. Details of context establishment are shown in the first figure in section 3.1.4.

Note that the timestamp-like tokens (formally called ChangeIds) used in this protocol create serialized entries in the server's change log and are opaque for the client applications. These tokens cannot be interpreted by the protocol client as moments in time.

After establishing context, the indexing service proceeds with a full traversal of the site. Traversal requires identifying all subsites, for each subsite identifying its lists and document libraries, and then scanning all list items and documents to peruse their content. The operations for this process are described in detail in section 3.1.4.

Note that a site can store the documents in a variety of proprietary formats, as well as in many international languages. This protocol does not help the indexing service understand or interpret those formats and languages; it just allows the indexing service to retrieve them.

When the indexing process is done traversing the site content, it reaches the "Full index more or less up to date; Sleep for a while" state indicated in Figure 2. The reason for the "more or less" qualifier is that the site content might undergo changes before the indexing service completes full traversal of the site content. In this case the full index can be already out-of-date with respect to the content, yet the basis for incremental change tracking has been built.

The indexing service then switches to incremental change requests. It uses a token to query the protocol server for all changes since full indexing started. The protocol server returns a change report, starting with the oldest changes since the timestamp indicated by the request token. To accommodate large numbers of changes, the report contains a "next change" token for indicating that all changes, from the timestamp of the requested token to the timestamp of the returned token, are included in the report, though there might be additional changes. During the next incremental index process, the indexing agent can use the "next change" token as the "since" token in the request.

While this protocol does not prescribe the frequency of change requests, change requests need to be done often enough to keep pace with the frequency of site content changes. Because there is a limit to the amount of time the protocol server can maintain change tracking records, it is possible for the indexing service to fall so far behind that it receives a "token is too old, and change records are lost" response from the protocol server. This response causes the indexing service to abandon the incremental indexing process and restart full traversal anew. Details of this incremental indexing process are shown in the third figure in section 3.1.4.

1.3.2 Site Indexing Process

Additional resources