Export (0) Print
Expand All

1.1 Glossary

The following terms are defined in [MS-GLOS]:

hash function

The following terms are specific to this document:

chunks: The pieces of a file defined by the cut points.

client: For the purposes of this document only, the client is the target location machine.

collision-resistant hash function: A hash function having the property that (in practice) differing inputs do not produce the same hash (or collide).

cut points: The locations in a file where RDC has determined boundary points between blocks (or chunks). The cut points for a particular file depend on the content of the file and the parameters with which RDC is running.

file: A file is a typed data stream. For the purposes of this document only, file does not imply storage of the data stream in any particular medium or with any particular organization, or, for example, in a file system (italic is used when referring to traditional files).

hash window: The length, in bytes, of the domain of the rolling hash function. That is, the parameter n in the definition of rolling hash function.

horizon: An integer parameter of the RDC FilterMax algorithm. It refers to the number of consecutive hash values on both sides of a file offset.

local maximum: A pair consisting of an offset i in a file and the hash value h(bi-Hash Window .. bi) that has the property that for all j ≥ 0, such that i - horizon ≤ j ≤ i + horizon, j = i OR h(bj-Hash Window .. bj) < h(bi-Hash Window .. bi), where for all k < 0, bk is defined to be 0. Local maxima are used to find cut points by the RDC FilterMax algorithm.

MD4: Message Digest 4, as defined in [RFC1320]. MD4 is a collision-resistant, non-rolling hash function that produces a 16-byte hash. While MD4 is no longer considered to be cryptographically secure, RDC does not rely on cryptographic security in its hash function.

min-wise independent hash functions: A set H = { H1, H2, … , Hn } of hash functions is said to be min-wise independent if, for any input D and for any subset X of {1, … , n}, any index i of X has an equal probability of yielding the smallest value Hi(D) of the values Hj(D) for j in X.

RDC FilterMax algorithm: The algorithm that RDC uses to determine the cut points in a file. The RDC FilterMax algorithm has the property that it will often find cut points that result in identical chunks being found in differing files, even when the files differ by insertions and deletions of bytes, not simply by length-preserving byte modifications. See section 3.1.5.1.

recursion level: For very large files, even the signature file may be large. To reduce the bandwidth required to transfer signature files, RDC may be used to reduce the number of bytes that must be moved from source location to target location to transfer the signature file. This process may be repeated a number of times, producing successively smaller signature files. The recursion level is the number of times it is repeated. The first signature file is recursion level 1. The source file is referred to as being at recursion level 0.

remote differential compression (RDC): In general, the term RDC may refer to any of a class of compression algorithms designed to compare two files residing at different locations without requiring one of the files to be transferred in its entirety to the other location. In this document, unless otherwise noted, the terms remote differential compression and RDC always refer to the RDC algorithm developed by Microsoft.

rolling hash function: A hash function that can be computed incrementally over a set of data. Given an arbitrary integer n ≥ 0, some bytes b0 .. bn-1, and their hash h(b0 .. bn-1), a hash function h is a rolling hash function if one can compute h(b1 .. bn) in time that does not depend on n.

seed data: See seed file.

seed file: A file or files at the target location used to supply data used in reconstructing the source file. RDC may use an arbitrary number of seed files in the process of copying a single source file. Selecting seed files can be guided by using similarity traits (see section 3.1.5.4.2).

server: For the purposes of this document only, the server is the source location.

signature: A structure containing a hash and block chunk size. The hash field is 16 bytes, and the chunk size field is a 2-byte unsigned integer. See section 2.2.2.1.

signature file: A file containing the signatures of another (source) file. There is a simple header that identifies the type of the file as a signature file, the size of the header itself, and the RDC library version number. Following the header are the signatures from the source file in the order they are generated from the chunks. See section 2.2.2.

similarity data: Information on a file that can be used to determine an appropriate seed file to select to reduce the amount of data transferred. Similarity data consists of one or more similarity traits.

similarity trait: A trait that summarizes an independent feature of a file. The features are computed by taking min-wise independent hash functions of a file'ssignatures. For information about how traits are computed, see section 3.1.5.4. Similarity traits are used in selecting seed files.

source data: See source file.

source file: A file on a source location that is to be copied by RDC. Sometimes referred to as source.

source location: The source location is the location from which a file is being transferred after it has been compressed with RDC.

target file: A file on the target location that is the destination of an RDC copy.

target location: The target location is the destination location of a file that has been compressed by RDC.

MAY, SHOULD, MUST, SHOULD NOT, MUST NOT: These terms (in all caps) are used as described in [RFC2119]. All statements of optional behavior use either MAY, SHOULD, or SHOULD NOT.

 
Show:
© 2015 Microsoft