When an index is built, statistics are collected for use in ranking. The process of building a full-text catalog does not directly result in a single index structure. Instead, the Full-Text Engine for SQL Server creates intermediate indexes as data is indexed. The Full-Text Engine then merges these indexes into a larger index as needed. This process can be repeated many times. The Full-Text Engine then conducts a "master merge" that combines all of the intermediate indexes into one large master index.
Statistics are collected at each intermediate index level. The statistics are merged when the indexes are merged. Some statistical values can only be generated during the master merging process.
While ranking a query result set, SQL Server uses statistics from the largest intermediate index. This depends on whether intermediate indexes have been merged or not. As a result, ranking statistics can vary in accuracy if the intermediate indexes have not been merged. This explains why the same query can return different rank results over time as full-text indexed data is added, modified, and deleted, and as the smaller indexes are merged.
To minimize the size of the index and computational complexity, statistics are often rounded.
The list below includes some commonly used terms and statistical values that are important in calculating rank.
- Property
-
A full-text indexed column of the row.
- Document
-
The entity that is returned in queries. In SQL Server this corresponds to a row. A document can have multiple properties, just as a row can have multiple full-text indexed columns.
- Index
-
A single inverted index of one or more documents. This may be entirely in memory or on disk. Many query statistics are relative to the individual index where the match occurred.
- Full-Text Catalog
-
A collection of intermediate indexes treated as one entity for queries. Catalogs are the unit of organization visible to the SQL Server administrator.
- Word, token or item
-
The unit of matching in the full-text engine. Streams of text from documents are tokenized into words, or tokens by language-specific word breakers.
- Occurrence
-
The word offset in a document property as determined by the word breaker. The first word is at occurrence 1, the next at 2, and so on. In order to avoid false positives in phrase and proximity queries, end-of-sentence and end-of-paragraph introduce larger occurrence gaps.
- TermFrequency
-
The number times the key value occurs in a row.
- IndexedRowCount
-
Total number of rows indexed. This is computed, based on counts maintained in the intermediate indexes. This number can vary in accuracy.
- KeyRowCount
-
Total number of rows in the full-text catalog that contain a given key.
- MaxOccurrence
-
The largest occurrence stored in a full-text catalog for a given property in a row.
- MaxQueryRank
-
The maximum rank, 1000, returned by the Full-Text Engine.