Forward bit parser

FSE decoder distribution table

The FSE decoder is used at two places in the Zstandard decoding process:

  • to decode the Huffman decoder tables: two similar FSE decoders produce output symbols alternatively using the same input bitstream;
  • to decode the sequences of instructions used to rebuild the output in a compressed block: three FSE decoders will be used simultaneously.

The principles of the FSE decoding process have been explained here.

The FSE decoder states are built from1:

  • An accuracy log \(AL\)
  • A distribution table \(\mathcal{D}\) which describes how symbols are distributed amongst the states. Every entry in the table is a number in the \([-1, R]\) range.

Since all the states will be filled, and since \(-1\) is just a special value which means that a symbol occupies only one state, we have \[R = \sum_{d \in \mathcal{D}}|d| = 2^{AL}\].

The various arguments will be read from a bitstream coming from the file being decoded:

  • \(AL\) is read first and computed from 4 bits, which also gives us the states table size \(R\)
  • The distribution table is read, one symbol after the other. Since the sum of absolute values in the distribution table cannot exceed \(R\), you always know the maximum value that can be read. This lets you determine the number of bits you need to read from the stream in order to decode this value.
  • You will know when you have a full table, as the sum of absolute values will be equal to \(R\). It means that you know when to stop reading the bitstream.

For reasons which will become apparent later, the data pertaining to the FSE table description are parsed with a forward bit parser. This is a parser similar to the backward bit parser you have written already, but bits are consumed forward.

Parsing a bitstream forwards

What does it mean to parse a bitstream forwards?

  • You must start from the first byte and progress in regular order.
  • In every byte, you start reading bits from the least significant bit to the most significant bit.
  • If you read several bits at a time to get an integer value, the first bit you get is the least significant one.

For example, let us start from the [0x05, 0x74] byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100].

The bit string will be consumed starting from the byte marked α then the byte marked β, following the arrow direction inside each byte:

00000101 01110100
<---α--| <---β--|
       ^

Now, let us read a 2 bits value. We get 0b01 (remember that the first bit read is the least significant one in the result) and the caret advances 2 positions:

000001-- 01110100
<---α--| <---β--|
     ^

Similarly, reading 9 bits yields 0b100000001 (or 0b100_000001 to make it clearer which part comes from each input byte), and the caret advances 9 positions:

-------- 01110---
<---α--| <---β--|
             ^

Reading four more bits yields 0b1110:

-------- 0-------
<---α--| <---β--|
         ^

If we don't want to read any further, the remaining bits are discarded. Note that contrary to the backwards bit parser case, we did not have to skip an initial header in order to start decoding data.

✅ Implement a ForwardBitParser type in the parsing module.

The following methods would be useful for such a parser:

impl<'a> ForwardBitParser<'a> {
    /// Create a new backward bit parser. The header is skipped automatically or
    /// an error is returned if the initial 1 cannot be found in the first 8 bits.
    pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }

    /// True if there are no bits left to read
    pub fn is_empty(&self) -> bool { todo!() }

    /// Get the given number of bits, or return an error.
    pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}

Using the forward bit parser

You will now read a FSE decoder table

✅ In a decoders::fse module, add a parse_fse_table() method taking a mutable reference to a ForwardBitParser and returning either a (accuracy_log, distribution) table or an error if decoding the table failed.

The signature would be:

pub fn parse_fse_table(parser: &mut ForwardBitParser) -> Result<(u8, Vec<i16>)> {
    todo!()
}

The process to decode a FSE table is described with many details in the Zstandard RFC:

  • Read a 4 bits value and add 5. This gives you the accuracy log \(AL\), as well as the table size \(R = 2^{AL}\). You may reject large accuracy log (greater than 9 for example) as you will not encounter them in the real life.
  • As long as the sum of the absolute values of the distribution table entries is strictly lower than \(R\), you must:
    • Compute the maximum value \(m = R-\sum_{d\in\mathcal{D}}|d|\) that the next entry may reach.
    • You need to read a value between 0 and \(m+1\) from the stream – you will later subtract 1 from this value to get a number between -1 and \m\). Compute the number of bits \(b\) you need to read such a value. ⚠️ How many values are between 0 and \(m+1\), inclusive?
    • You may need to read either \(b-1\) bits or \(b\) bits to get the value \(v\) from the bitstream. You can do this in several ways:
      • Add a mechanism to your forward bit parser to reinject a bit that you haven't needed (hard)
      • Add a peek() method which returns the next bit in the bitstream without consuming it (easy).
      • Determine from the \(b-1\) value that you have read if you need an extra bit or not (easiest).
    • \(p = v-1\) is the new table coefficient, add it to the table. If the total exceeds \(R\), return an error.
    • If \(p = 0\), you must read 2 extra bits in the bitstream to get the number of extra zeroes to insert in the table. If you get 3 (0b11), you must repeat the process for extra zeroes.

Does your implementation work?

Add some tests for your implementation, such as:

#[test]
fn parse_fse_table() {
    let mut parser = ForwardBitParser::new(&[0x30, 0x6f, 0x9b, 0x03]);
    let (accuracy_log, table) = parse_fse_table(&mut parser).unwrap();
    assert_eq!(5, accuracy_log);
    assert_eq!(&[18, 6, 2, 2, 2, 1, 1][..], &table);
    assert_eq!(parser.len(), 6);
}

This test shows that:

  • the table which is described has \(32 = 2^5\) states;
  • those 32 states can outputs symbols 0 to 6;
  • the approximate respective probabilities of the symbols in the output stream will be \(\frac{18}{32}\) for symbol 0, \(\frac{6}{32}\) for symbol 1, … (meaning that 18 states will output symbol 0, 6 states will output symbol 1, …);
  • 6 bits were unused in the provided bistream.
1

Note that the way adopted by the Zstandard RFC is not the only way to represent a FSE table. In particular, the way chosen here, through a distribution of probabilities, assumes that any symbol (present in the uncompressed stream) can follow any symbol, while more specialized tables could allow the full scope of FSE decoders as shown in the introductory FSE example.