1.1 Glossary

This document uses the following terms:

chunk: For remote differential compression (RDC), a piece of a file defined by a cut point.

collision-resistant hash function: A hash function having the property that (in practice) differing inputs do not produce the same hash (or collide).

cut points: The locations in a file where RDC has determined boundary points between blocks (or chunks). The cut points for a particular file depend on the content of the file and the parameters with which RDC is running.

hash function: A function that takes an arbitrary amount of data and produces a fixed-length result (a "hash") that depends only on the input data. A hash function is sometimes called a message digest or a digital fingerprint.

hash window: The length, in bytes, of the domain of the rolling hash function. That is, the parameter n in the definition of rolling hash function.

horizon: An integer parameter of the RDC FilterMax algorithm. It refers to the number of consecutive hash values on both sides of a file offset.

local maximum: A pair consisting of an offset i in a file and the hash value. Local maxima are used to find cut points by the RDC FilterMax algorithm.

MD4: Message Digest 4, as defined in [RFC1320]. MD4 is a collision-resistant, non-rolling hash function that produces a 16-byte hash. While MD4 is no longer considered to be cryptographically secure, RDC does not rely on cryptographic security in its hash function.

RDC FilterMax algorithm: The algorithm that RDC uses to determine the cut points in a file. The RDC FilterMax algorithm has the property that it will often find cut points that result in identical chunks being found in differing files, even when the files differ by insertions and deletions of bytes, not simply by length-preserving byte modifications. See section 3.1.5.1.

recursion level: For very large files, even the signature file can be large. To reduce the bandwidth required to transfer signature files, RDC can be used to reduce the number of bytes that are moved from source location to target location to transfer the signature file. This process can be repeated a number of times, producing successively smaller signature files. The recursion level is the number of times it is repeated. The first signature file is recursion level 1. The source file is referred to as being at recursion level 0.

remote differential compression (RDC): Any of a class of compression algorithms that are designed to compare two files residing on different machines without requiring one of the files to be transmitted in its entirety to the other machine. For more information, see [MS-RDC].

rolling hash function: A hash function that can be computed incrementally over a set of data. A hash function h is a rolling hash function if one can compute h(b1 .. bn) in time that does not depend on n.

seed file: A file or files at the target location used to supply data used in reconstructing the source file. RDC can use an arbitrary number of seed files in the process of copying a single source file. Selecting seed files can be guided by using similarity traits (see section 3.1.5.4.2).

signature: A structure containing a hash and block chunk size. The hash field is 16 bytes, and the chunk size field is a 2-byte unsigned integer.

signature file: A file containing the signatures of another (source) file. There is a simple header that identifies the type of the file as a signature file, the size of the header itself, and the RDC library version number. Following the header are the signatures from the source file in the order they are generated from the chunks. See section 2.2.2.

similarity data: Information on a file that can be used to determine an appropriate seed file to select to reduce the amount of data transferred. Similarity data consists of one or more similarity traits.

similarity trait: A trait that summarizes an independent feature of a file. The features are computed by taking min-wise independent hash functions of a file's signatures. For information about how traits are computed, see section 3.1.5.4. Similarity traits are used in selecting seed files.

source file: A file on a source location that is to be copied by RDC. Sometimes referred to as source.

source files: A collection of files that are used to implement an InfoPath form. File types can include HTML, XML, XSD, XSLT, and script.

source location: The source location is the location from which a file is being transferred after it has been compressed with RDC.

target file: A file on the target location that is the destination of an RDC copy.

target location: The target location is the destination location of a file that has been compressed by RDC.

MAY, SHOULD, MUST, SHOULD NOT, MUST NOT: These terms (in all caps) are used as defined in [RFC2119]. All statements of optional behavior use either MAY, SHOULD, or SHOULD NOT.