2.1.4.3 Final Encoding Phase
In the final encoding phase, the algorithm processes the intermediate encoding of literals and matches generated by the LZ77 ([UASDC]) phase. It re-encodes each literal and match using the canonical Huffman codes, but first it encodes the Huffman symbol bit lengths.
Each symbol bit length is encoded with 4 bits. Bit lengths for even-valued symbols are stored in the lower 4 bits of the bytes, whereas bit lengths for odd-valued symbols are stored in the higher 4 bits.
For example, if the bit lengths of symbols 0, 1, 2, and 3
were 5, 6, 7, and 8, respectively, the first 2 bytes of the output buffer would
be 0x65 0x87
. The Huffman [IEEE-MRC]
construction process guarantees that each bit length fits in 4 bits. Symbols
that are never used, and therefore have no Huffman code, have the
special value of zero.
Because there are 512 Huffman symbols, and the format stores two lengths per byte, this part of the output data will always be exactly 256 bytes.
Following the 256-byte table, the format encodes the sequence of literals and matches. Literals are distinguished from matches by the value of the Huffman symbol: symbol values less than 256 are literals, whereas symbols greater than 255 are matches. Most matches require more bits to fully encode the distance and the length.
As explained in section 2.1.4.1, the match symbol
value encodes the length of the match (up to 17) and the bit index of the
highest set bit in the distance. If this bit index is, for example, 3, the
decompression function can determine that the distance is at least 1000 (1000
binary, or 8 decimal) and at most 1111 (1111 binary, or 15 decimal). It can
also compute that 3 more bits of information are required to determine the
exact distance. Therefore, the encoder encodes the lower 3 bits of the distance
directly in the output bit stream (which is also used to encode the
variable-length Huffman codes). In general, the encoder explicitly encodes the
lower <GetHighBit(Distance)>
bits
immediately following the match's Huffman symbol.
The encoder is required to process match lengths longer than 17. If the length is less than 18, the decoder can determine it directly from the match symbol by taking the lower 4 bits and adding 3. A lower-four-bits value of 15 is a special case that means the length is at least 18, and the full length is encoded with more bits. Unlike the extra-distance bits, the extra-length bits are not encoded seamlessly in the Huffman bit stream. Longer lengths are encoded with an extra byte in the output, and if that is not enough, an additional 2 bytes. The location of these extra bytes is such that, if the decompression function reads the Huffman bit stream in 2-byte chunks, these extra bytes are the next bytes that the decompression function will read.
Implementations of the decompression algorithm may expect an extra symbol to mark the end of the data. For example, certain implementations fail during decompression if the Huffman symbol 256 is not found after the actual data. For this reason, the encoding algorithm SHOULD append this symbol and increment the count of symbol 256 before the Huffman codes are constructed.
The following pseudocode demonstrates the encoding method.
-
Write the 256-byte table of symbol bit lengths While there are more literals or matches to encode If the next thing is a literal WriteBits(SymbolLength[LiteralValue], SymbolCode[LiteralValue]) Else // the next thing is a match Extract the length and distance of the match MatchSymbolValue = 256 + min(Length, 15) + (16 * GetHighBit(Distance)) WriteBits(SymbolLength[MatchSymbolValue], SymbolCode[MatchSymbolValue]) Length = Length - 3 If (Length) >= 15 Length = Length – 15 If (Length < 255) WriteByte(Length) Else WriteByte(255) Length = Length + 15 If (Length < 65536) WriteUint16(Length) Else WriteUint16(0) WriteUint32(Length) WriteBits(GetHighBit(Distance), Distance – (1 << GetHighBit(Distance))) WriteBits(SymbolLength[256], SymbolCode[256]) FlushBits()
The WriteBits, WriteByte, WriteUint16, WriteUint32, and FlushBits functions implicitly use five variables, which are initialized as follows:
-
FreeBits = 16 NextWord = 0 OutputPosition1 = OutputBufferPointer + 256 OutputPosition2 = OutputBufferPointer + 258 OutputPosition = OutputBufferPointer + 260
The following pseudocode shows the implementation of the functions. Note that a complete implementation must also include bounds checks to ensure that nothing is written beyond the output buffer.
-
WriteBits (NumberOfBitsToWrite, BitsToWrite) If FreeBits >= NumberOfBitsToWrite FreeBits = FreeBits – NumberOfBitsToWrite NextWord = (NextWord << NumberOfBitsToWrite) + BitsToWrite Else NextWord = (NextWord << FreeBits) NextWord = NextWord + (BitsToWrite >> (NumberOfBitsToWrite – FreeBits)) FreeBits = FreeBits – NumberOfBitsToWrite Write (NextWord & 0xFF) to OutputPosition1 Write (NextWord >> 8) to OutputPosition1 + 1 OutputPosition1 = OutputPosition2 OutputPosition2 = OutputPosition Advance OutputPosition by 2 bytes FreeBits = FreeBits + 16 NextWord = BitsToWrite End WriteByte (ByteToWrite) Write ByteToWrite to OutputPosition Advance OutputPosition by 1 byte End WriteUint16 (Value) Write bytes for Value to OutputPosition Advance OutputPosition by 2 bytes End WriteUint32 (Value) Write bytes for Value to OutputPosition Advance OutputPosition by 4 bytes End FlushBits () NextWord <<= FreeBits Write (NextWord & 0xFF) to OutputPosition1 Write (NextWord >> 8) to OutputPosition1 + 1 Write a 16-bit value of zero to OutputPosition2 The final compressed size is the value of OutputPosition End