The decompressor: part 2

Your decompressor is already able to handle a subset of the Zstandard file format. However, it does not yet take advantage of the most efficient compression techniques used by the standard.

Zstandard utilizes two efficient compressors: Huffman codes and FSE.

  • Huffman codes are used to encode byte-oriented sequences with \(2^8\) possible output symbols. These codes are employed in the literals section, which consists of portions of data to be copied to the output stream during decompression.
  • FSE codes are used to encode data with a smaller number of output symbols. They are used for instructions that guide the overall decoding process, known as sequences, as well as coefficients used to initialize the Huffman decoder.

💡 Ensure you understand fully the Huffman and FSE examples, and that you obtain the same result while decoding the bit strings by hand.

Huffman code

The Huffman code is a prefix code. Bits are read one-by-one. Either the current sequence of bits designates a symbol to add to the output (for example a byte), or it is incomplete and more bits need to be read. Most frequent symbols get assigned shorter bit sequences, while less frequent symbols get assigned longer bit sequences. Once a symbol has been decoded, the decoder restarts from scratch and handles the next input bits.

Of course, no bit sequence for a symbol may correspond to the prefix of another bit sequence.

Practical Huffman code example

For example, if we only had 3 symbols A, B and C in our ouput, and B is more frequent than A and C, Huffman coding could use a dictionary such as:

Input bitsOutput symbol
00A
01C
1B

The input bit sequence 10010111 would get parsed as 1_00_1_01_1_1 and decoded as BABCBB. 8 input bits decode as 6 output symbols, taking approximately 1.33 bit by symbol.

Note that while B is represented as a single 1, all other bit sequences start with 0. No ambiguity can arise while adding bits as needed. The table can be represented as a prefix tree:

Huffman tree representation

In Zstandard, longer prefixes start with zeroes and shorter prefixes with ones. Also, for the same prefix width, symbols are ordered, which is why A uses 00 because it comes before C at 01.

FSE code

While the Huffman decoders starts from scratch once a symbol has been emitted, a FSE (finite-state entropy) decoder keeps an internal state. Upon entering a state, the decoder emits an output symbol, then reads a number of input bits to switch to another state, and so on. States are consecutively numbered starting from 0. Each state contains three pieces of information:

  • the output symbol to emit
  • the number of bits to read to determine the next state
  • the base line (a number) to add to the number read from the new bits to get the new state number

A FSE code has an accuracy log (AL) which represents the base 2 logarithm of the number of states. For example, a FSE decoder with 32 states will have an AL of 5, because \(2^5\) is 32. The AL is also the number of bits to read to determine the initial state.

Practical FSE example

Let us assume that we have a FSE table with 4 states. The accuracy log (AL) is 2.

StateOutput symbolBase line (BL)Number of bits to read
0A10
1D21
2B01
3A21

Let us decode the 1110000101 bit sequence (11_1_0_0_0_0_1_0_1 for clarity of this example). AL bits are read from the string, giving 11. The initial state is 3 (0b11):

  • Entering state 3 outputs symbol A. 1 bit (1) is added to the BL 2: new state is (0b1 + 2)
  • Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
  • Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL 1: new state is 1.
  • Entering state 1 outputs symbol D. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (0) is added to the BL 0: new state is 0 (0b0 + 0).
  • Entering state 0 outputs symbol A. 0 bits are read, the new state is the BL: new state is 1.
  • Entering state 1 outputs symbol D. 1 bit (1) is added to the BL 2: new state is 3 (0b1 + 2).
  • Entering state 3 outputs symbol A. 1 bit (0) is added to the BL 2: new state is 2 (0b0 + 2).
  • Entering state 2 outputs symbol B. 1 bit (1) is added to the BL 0: new state is 1 (0b1 + 0).
  • Entering state 1 outputs symbol D. No 1 bit is available, stopping

The decoded symbol sequence is AABADBADABD. One may note that:

  • One state may encode more than one output symbols, since the number of bits to read to get to the following state may be 0.
  • Contrary to a Huffman code, some symbols need less than one bit in the compressed representation. Indeed, here we have a 10 bits input string decoding into a 11 symbols output, that is approximately 0.9 bit per symbol.
  • Several states may encode the same symbol (states 0 and 3 both emit A).