2.1.4.1 LZ77 Phase

This phase processes each byte of the input data and produces two outputs: the intermediate LZ77 ([UASDC]) encoding of flags, literals, and matches; and the frequency of each symbol in the Huffman alphabet.

The following flowchart shows how the LZ77 phase works.

LZ77 phase

Figure 1: LZ77 phase

The hash table is an array of pointers to previous positions in the input buffer. It is used to find matches, as follows:

 HashValue = HashThreeBytes(InputBuffer[CurrentPosition],
                            InputBuffer[CurrentPosition+1],
                            InputBuffer[CurrentPosition+2]);
 PotentialMatch = HashTable[HashValue];
 HashTable[HashValue] = CurrentPosition;

The HashThreeBytes function SHOULD be quick to compute and provide a small number of collisions.

If the additional CPU cost is justified, the algorithm SHOULD be extended to search for longer matches than those provided by the basic hash table. This can be achieved with more hash tables, trees, or a chained hash table. Finding longer matches generally results in smaller compressed data but requires more time for the compression method to execute.

The intermediate compression format that is produced in this phase SHOULD be designed for quick encoding and decoding, and it SHOULD be small enough to guarantee its fit in a temporary buffer that is only slightly larger than the input buffer. The algorithm will be more efficient if it is not necessary to check whether the temporary buffer has sufficient space.

The intermediate compression format SHOULD use bitmasks grouped in 32-bit values to represent the literal or match flags. Also, literal values SHOULD be stored as simple bytes in the intermediate stream. Matches SHOULD be encoded in sizes that are guaranteed to be less than or equal to their lengths.

For example, a 3-byte match could use 1 byte for its length and 2 bytes for its distance. Much longer matches SHOULD be encoded with a 2-byte distance and a special length value (such as 0xFF) indicating that the full length is encoded in the next 2 or 4 bytes.

During the LZ77 phase, the algorithm SHOULD count the frequencies of the Huffman symbols it will later encode. The Huffman symbol for each literal or match is computed in the following way.

  • For literals, the Huffman symbol index is the value of the literal (ranging from 0 to 255, inclusive).

  • For matches, the Huffman symbol is computed from the length and distance by using the following code, in which GetHighBit(Distance) is defined as the bit index of the highest set bit in the binary representation of the distance.

     If (Length – 3) < 15
         HuffmanSymbol = 256 + (Length – 3) + (16 * GetHighBit(Distance))
     Else
         HuffmanSymbol = 256 + 15 + (16 * GetHighBit(Distance))
    

Note that this definition assumes that Distance is greater than 0, and this is a valid assumption in this context.

The following table provides examples of GetHighBit calculations.

Distance

Binary representation

GetHighBit(Distance)

1

…0001

0

2

…0010

1

5

…0101

2

7

…0111

2

The GetHighBit function SHOULD be efficiently computed with a precomputed 256-byte table.

 If Distance < 256
    DistanceHighBit = PrecomputedHighBitTable[Distance]
 Else (assuming Distance < (1 << 16))
    DistanceHighBit = 8 + PrecomputedHighBitTable[Distance >> 8]