3.1.1.3 Content Source

The crawler application contains a collection of zero or more content source objects. Each object represents a content source that is used to start a crawl on the index server.

crawlCompleted: The timestamp of when the most recent crawl was completed for this content source.

crawlPriority: Priority of crawl processing for this content source. It MUST be one of the values in the following table.

Value

Meaning

1

Normal

2

High. When picking the next URL to crawl from the crawl queue, the protocol server MUST give preferential consideration to every URL discovered from crawling high priority content sources over any URL discovered from crawling normal priority content sources.

crawlStarted: The timestamp of when the most recent crawl was started for this content source.

crawlStatus: Identifies whether a crawl for this content source is idle, paused, stopped, or running. Also identifies the type of crawl, either full crawl or incremental crawl.

deleteCount: The number of items deleted during the most recent crawl.

errorCount: The number of items, during the most recent crawl, where the crawler attempted to crawl the items, but did not succeed.

followDirectories: If set to "true", only links provided by the repository being crawled are followed during the crawl, and links discovered within items are discarded. If set to "false", only links discovered within items are followed.

fullCrawlTrigger: Defines the full crawl schedule. The crawl can be started either by explicit request from the protocol client, automatically, or at specified points of time according to the schedule.

id: The unique identifier of the content source in the collection. Assigned by the protocol server when a new content source is added.

incCrawlTrigger: Defines the incremental crawl schedule.

levelHighErrorCount: The number of important items crawled with error during the most recent crawl.

metadata: The arbitrary metadata associated by the protocol client with the content source. The value of this property is ignored by the protocol server, but can be interpreted by the protocol client to associate arbitrary metadata with the collection of content sources.

name: The content source name. This is the label intended to be read in user interfaces, for example by search administrators.

pageDepth: The maximum permitted depth of the URL space traversal, including traversal within a single site or across different sites. Whenever a link is followed by the index server during the crawl the depth counter is incremented. The depth counter cannot increase beyond the pageDepth of the content source. For example, if the pageDepth is set to "1" and Page A links to Page B, which links to Pages C and D, then neither pages C or D will be crawled because the depth counter would exceed pageDepth.

siteDepth: The depth of the URL space traversal in terms of authority hops. This is analogous to the pageDepth variable, but at a domain level. A server domain hop is made when a link points to a URL from a different server domain. Whenever a link is followed by the index server during the crawl to a different host (or item repository server), the site depth counter is incremented. The site depth counter cannot exceed the siteDepth of the content source.

startAddress: The start address URL. The first step of starting the crawl is to add the start address URLs to the crawl queue. The crawl then begins by following links from these start addresses.

successCount: The number of items successfully crawled during the most recent crawl.

systemCreated: If true, the content source was created during the initial system configuration and cannot be deleted by the protocol client. Any content sources added by the protocol client will have the systemCreated set to "false".

type: The content source type. This type is used by the crawler to determine which technology to use to crawl the repository pointed to by the start addresses. MUST be one of the following values:

Value

Meaning

0

Enables specifying settings that control the depth of crawl for a website based on the start address server, host hops, and page depth.

1

Enables specifying settings that control the depth of crawl for a website based on discovering everything under the hostname for each start address or only crawling the site collection of each start address.

2

Lotus Notes database.

3

File shares.

4

Exchange public folders.

5

Custom.

6

Legacy<2> Business Data Catalog.

8

Custom search connector.

9

Business Data Connectivity (BDC).

warningCount: The number of items crawled with warning during the most recent crawl.

wssCrawlStyle: The type of the crawl performed while crawling sites. MUST be one of the following values:

Value

Meaning

CrawlVirtualServers

The entire set of Web applications pointed to by the start addresses are crawled.

CrawlSites

Only the specific sites pointed to by the start addresses are crawled, without enumerating all sites in the Web application.

throttleBegin: This property is not interpreted by the protocol server. It can be set and retrieved by the protocol client.

throttleDuration: This property is not interpreted by the protocol server. It can be set and retrieved by the protocol client.