3.1.5.4 Similarity

The effectiveness of data replication using RDC techniques depends to a large degree on the seed files that can be found at the target location. If the RDC scenario is simple, such as the synchronization of a file that has been replicated recently, the choice of seed file can be as simple as the file that is being synchronized. If the RDC scenario is more complicated, such as the replication of a directory of files, it is possible that files have been added to the source location that do not exist on the target location, in which case there is no simple choice of seed files.

To help choose seed files under the circumstances outlined in the preceding paragraph, similarity data for new files at the source location are calculated and sent to the target location. The similarity data is used on the target location to find existing files that are similar to the new source location files. The existing target location files can then become the seed files for an RDC replication of the new source location files, making the transfer of the new files more efficient.

Similarity data can appear in protocols that use RDC, but it has no other effect on the protocol itself.<3> Recall that similarity data can be used for identifying a seed file on the target location. For examples of the use of similarity data, see [MS-FRS2] sections 1.7, 2.2.1.2.1, and 2.2.1.4.4.

The following diagram illustrates the calculation of similarity data.

Similarity data calculation

Figure 4: Similarity data calculation