2.7.3 XMHybridRLE Compression Algorithms

XMHybridRLE compression algorithms use two forms of compression in combination: run length encoding (RLE) and bit packing. As a result, these compression algorithms use two segments to represent all the values. The segments are referred to here as the RLE segment, or primary segment, and the subsegment, or bit-packing subsegment. The RLE segment contains a possible mix of RLE entries and bit-packing entries, the latter of which refer to bit-packed values that follow in the subsegment.

The first type of compression that is used in this hybrid compression is RLE. RLE is used only on appropriate data items—those that repeat often enough to make RLE an efficient compression choice. In RLE compression, two 4-byte values are used that together comprise the first type of entry in the RLE segment—the RLE entry. The first value of the RLE entry is the data value. The second value is the repeat count, which is the number of times that the data value repeats in a continuous sequence. The repeat count MUST be equal to or greater than 64. For an RLE run to be generated by the system, at least 64 consecutive items (that is, 64 identical records) MUST exist in that run of data. Otherwise, the individual items will be bit packed by using the chosen bit-packing algorithm in the related subsegment.

The second type of entry in the RLE segment is the bit-packing entry. The bit-packing entry also contains two 4-byte values. The first value is a negated 1-based offset into the subsegment data, and the second value is the count of the number of values that follow in the subsegment. The offset value is represented as the negative of itself to clearly distinguish it from the RLE entries. The bit-packed values exist in a separate bit-packing subsegment that follows the RLE segment.

The first bit-packing entry uses –1 as the first value (to indicate that it is the first data value in the subsegment) and then has the count of bit-packing items as the second value. Any subsequent bit-packing entries also have a negative offset value as the first value. This negative offset value is calculated by taking the previous negative bit-packing entry offset value, and subtracting from it the bit-packing entry count of that same entry.

Any number of RLE entries and bit-packing entries can exist in the RLE segment, and in any order. It is also possible for the RLE segment to have no RLE entries or no bit-packing entries in the RLE segment. In very rare cases, it is possible for both the segment and subsegment to be empty. For more information about segment minimum and maximum row sizes, see section 2.3.1.1.3.

The bit-packed compression values follow the RLE segment. The bit-packed values are in the same order that they are referred to here but within their own, bit-packed subsegment. All the bit-packed values that are referenced (as bit-packing entries) within the RLE segment use the same type of bit-packing compression because only one type of bit-packing compression is allowed per segment. In other words, after the RLE segment, the bit-packing subsegment that is associated with the RLE segment uses a single bit-packing algorithm, such as that for XMHybridRLECompressionInfo<class XMRENoSplitCompressionInfo<3>> in which each bit-packed entry is compressed by using 3 bits. For more information about the column data storage file format, see section 2.3.1.

For a diagrammed example of RLE entries mixed with bit-packing entries, see section 2.7.3.1.

After the end of all the data entries (both the RLE entries and the bit-packing entries) in the RLE segment, any remaining storage area allocated for the RLE segment is zeroed out. The size of this padding depends on the amount of storage allocated and the amount used. These amounts can be found in the metadata information for this compression (see section 2.5.2.38.1). 

Following the end of this segment (including any padding at the end of the segment), the next segment begins. This segment is called the subsegment of the XMHybridRLE compression and contains the selected bit-packing compression sequence. For more information about the column data storage file format, which is the file format that uses this hybrid compression dual segment layout, see section 2.3.1.

The second part of the hybrid compression, the subsegment, involves using a bit-packing compression algorithm. This part is used for data items that are not suitable for RLE compression, and it uses a form of range encoding bit-packing compression in which the data to be stored (after that data has been offset by a minimum value) is compacted into a specified number of bits. The bit size that is chosen determines the bit-packing algorithm that is used to compress the data. The values that are bit packed in this segment conform to the sequence that is specified in the RLE segment.

Despite any name similarity, the range encoding bit-packing algorithm that is used in the hybrid compression is not necessarily the same as the range encoding bit-packing algorithm that is used in the nonhybrid case.

The storage area for the compressed value, whether RLE or bit packed, MUST be zeroed out prior to compressing the value and storing it in that area.