Similarity Data Calculation

Given a file F to calculate the similarity data, a 16-byte array is required to store intermediate MD4 digests. The data construct is called a digest for the purpose of this discussion. The calculation of similarity data for a file is expressed using the following pseudocode.

 Define Digest as unsigned char[16]
 Digest md4Results
 Digest sData[16] =
  { {0xff, 0xff, ..., 0xff},
    {0xff, 0xff, ..., 0xff},
    {0xff, 0xff, ..., 0xff} }
 unsigned char tempBuffer[17]
 FOR each RDC Signature of a File
     Copy the RDC Signature data into tempBuffer
     FOR index = 0 to 15
         tempBuffer[16] := index + 1
         Compute an MD4 digest for the data in tempBuffer,
             put the results in md4Results
         IF md4Results < sData[index]
             Copy the MD4 digest from md4Results to sData[index]

The less-than comparison "md4Results < sData[index]" is a byte-by-byte comparison. For example, given A and B (two digests), the digests are considered as arrays of 16 unsigned 8-bit integers, where A0 is the first byte and A15 is the last byte of A. A < B is true if the first Ai that is different from Bi is less than Bi for values of i in 0 to 15.

Once all RDC signatures for a file have been processed, the similarity data is extracted from the array of temporary digest constructs (named sData in the preceding pseudocode). The similarity data is constructed by taking the 8th byte from each of the 16 digests computed in the preceding pseudocode and masking off the top 2 bits. This is shown in the following pseudocode.

 unsigned char similarity[16]
 FOR index = 0 to 15
     similarity[index] := sData[index, 7] & 0x3F

The data in similarity, shown in the preceding pseudocode, is the similarity data for the file.

Each trait of similarity data MUST be a value from 0 to 63 inclusive.

An implementation that does not compute similarity data MUST fill in all traits as zero.