2.1.4.3 Final Encoding Phase

Article
01/09/2024

In the final encoding phase, the algorithm processes the intermediate encoding of literals and matches generated by the LZ77 ([UASDC]) phase. It re-encodes each literal and match using the canonical Huffman codes, but first it encodes the Huffman symbol bit lengths.

Each symbol bit length is encoded with 4 bits. Bit lengths for even-valued symbols are stored in the lower 4 bits of the bytes, whereas bit lengths for odd-valued symbols are stored in the higher 4 bits.

For example, if the bit lengths of symbols 0, 1, 2, and 3 were 5, 6, 7, and 8, respectively, the first 2 bytes of the output buffer would be 0x65 0x87. The Huffman [IEEE-MRC] construction process guarantees that each bit length fits in 4 bits. Symbols that are never used, and therefore have no Huffman code, have the special value of zero.

Because there are 512 Huffman symbols, and the format stores two lengths per byte, this part of the output data will always be exactly 256 bytes.

Following the 256-byte table, the format encodes the sequence of literals and matches. Literals are distinguished from matches by the value of the Huffman symbol: symbol values less than 256 are literals, whereas symbols greater than 255 are matches. Most matches require more bits to fully encode the distance and the length.

As explained in section 2.1.4.1, the match symbol value encodes the length of the match (up to 17) and the bit index of the highest set bit in the distance. If this bit index is, for example, 3, the decompression function can determine that the distance is at least 1000 (1000 binary, or 8 decimal) and at most 1111 (1111 binary, or 15 decimal). It can also compute that 3 more bits of information are required to determine the exact distance. Therefore, the encoder encodes the lower 3 bits of the distance directly in the output bit stream (which is also used to encode the variable-length Huffman codes). In general, the encoder explicitly encodes the lower <GetHighBit(Distance)> bits immediately following the match's Huffman symbol.

The encoder is required to process match lengths longer than 17. If the length is less than 18, the decoder can determine it directly from the match symbol by taking the lower 4 bits and adding 3. A lower-four-bits value of 15 is a special case that means the length is at least 18, and the full length is encoded with more bits. Unlike the extra-distance bits, the extra-length bits are not encoded seamlessly in the Huffman bit stream. Longer lengths are encoded with an extra byte in the output, and if that is not enough, an additional 2 bytes. The location of these extra bytes is such that, if the decompression function reads the Huffman bit stream in 2-byte chunks, these extra bytes are the next bytes that the decompression function will read.

Implementations of the decompression algorithm may expect an extra symbol to mark the end of the data. For example, certain implementations fail during decompression if the Huffman symbol 256 is not found after the actual data. For this reason, the encoding algorithm SHOULD append this symbol and increment the count of symbol 256 before the Huffman codes are constructed.

The following pseudocode demonstrates the encoding method.

 Write the 256-byte table of symbol bit lengths
 While there are more literals or matches to encode
     If the next thing is a literal
         WriteBits(SymbolLength[LiteralValue], SymbolCode[LiteralValue])
     Else      // the next thing is a match
         Extract the length and distance of the match
         MatchSymbolValue = 256 + min(Length, 15) + (16 * GetHighBit(Distance))
         WriteBits(SymbolLength[MatchSymbolValue], SymbolCode[MatchSymbolValue])
         Length = Length - 3
         If (Length) >= 15
             Length = Length – 15
             If (Length < 255)
                 WriteByte(Length)
             Else
                 WriteByte(255)
                 Length = Length + 15
                 If (Length < 65536)
                     WriteUint16(Length)
                 Else
                     WriteUint16(0)
                     WriteUint32(Length)
     WriteBits(GetHighBit(Distance), Distance – (1 << GetHighBit(Distance)))
 WriteBits(SymbolLength[256], SymbolCode[256])
 FlushBits()

The WriteBits, WriteByte, WriteUint16, WriteUint32, and FlushBits functions implicitly use five variables, which are initialized as follows:

 FreeBits = 16
 NextWord = 0
 OutputPosition1 = OutputBufferPointer + 256
 OutputPosition2 = OutputBufferPointer + 258 
 OutputPosition  = OutputBufferPointer + 260

The following pseudocode shows the implementation of the functions. Note that a complete implementation must also include bounds checks to ensure that nothing is written beyond the output buffer.

 WriteBits (NumberOfBitsToWrite, BitsToWrite)
     If FreeBits >= NumberOfBitsToWrite
         FreeBits = FreeBits – NumberOfBitsToWrite
         NextWord = (NextWord << NumberOfBitsToWrite) + BitsToWrite
     Else
         NextWord = (NextWord << FreeBits)
         NextWord = NextWord + (BitsToWrite >> (NumberOfBitsToWrite – FreeBits))
         FreeBits = FreeBits – NumberOfBitsToWrite
         Write (NextWord & 0xFF) to OutputPosition1
         Write (NextWord >> 8) to OutputPosition1 + 1
         OutputPosition1 = OutputPosition2
         OutputPosition2 = OutputPosition
         Advance OutputPosition by 2 bytes
         FreeBits = FreeBits + 16
         NextWord = BitsToWrite
 End
 WriteByte (ByteToWrite)
     Write ByteToWrite to OutputPosition
     Advance OutputPosition by 1 byte
 End
 WriteUint16 (Value)
     Write bytes for Value to OutputPosition
     Advance OutputPosition by 2 bytes
 End
 WriteUint32 (Value)
     Write bytes for Value to OutputPosition
     Advance OutputPosition by 4 bytes
 End
 FlushBits ()
     NextWord <<= FreeBits
     Write (NextWord & 0xFF) to OutputPosition1
     Write (NextWord >> 8) to OutputPosition1 + 1
        Write a 16-bit value of zero to OutputPosition2
     The final compressed size is the value of OutputPosition
 End

2.1.4.3 Final Encoding Phase

Additional resources