Backward bit parser

The Huffman decoder you have just built needs to be fed some bits. In the same way that you have built a forward byte parser, you will now have to build a backward bit parser.

Why backward? Because of the way Zstandard encoders work, starting from the end of the stream to compress and writing the various bits to the compressed stream.

However, writing a stream of bits does not necessarily mean that the encoder will fill an integral number of bytes. To signal the end of the encoded bitstream, the encoder writes an extra 1 then completes the byte with 0 bits if needed.

Parsing a bitstream backwards

What does it mean to parse a bitstream backwards?

  • You must start from the last byte and progress in reverse order.
  • In every byte, you start reading bits from the most significant bit to the least significant bit.
  • When you start producing the bitstream, you must skip all initial zeroes and the first one you encounter.

For example, let us start from the [0x05, 0x74] byte sequence found in a Zstandard file. In binary, this is written as [0b00000101, 0b01110100].

The bit string will be consumed starting from the byte marked α then the byte marked β, following the arrow direction inside each byte:

00000101 01110100
|---β--> |---α-->
         ^

The caret (^) represents the next bit to read. First, the initial zeroes must be zapped until the first one is found and discarded. This happens necessarily within the initial byte. After discarding the header, the caret as advanced two positions (one initial zero and the initial one):

00000101 --110100
|---β--> |---α-->
           ^

Now, let us read a 3 bits value. We get 110 and the caret advances 3 positions.

00000101 -----100
|---β--> |---α-->
              ^

Reading 7 bits will give us 0b1000000 (3 bits from the current byte, 4 bits from the byte just before).

----0101 --------
|---β--> |---α-->
    ^

Reading 4 bits yields 0b0101: we have reached the end of the stream.

-------- --------
|---β--> |---α-->

As another example, here is how the two bitstreams from the examples would be encoded in the Zstandard file:

  • Huffman code: 10010111 would be [0b10010111, 0b00000001], or [0x97, 0x01]. Seven 0 from the last byte will be skipped before the initial 1 is found and skipped.
  • FSE code: 1110000101 would be [10000101, 00000111], or [0x85, 0x07]. Five 0 and the initial 1 will be skipped.

✅ Implement a BackwardBitParser type in the parsing module.

The following methods would be useful for such a parser:

impl<'a> BackwardBitParser<'a> {
    /// Create a new backward bit parser. The header is skipped automatically or
    /// an error is returned if the initial 1 cannot be found in the first 8 bits.
    pub fn new(data: &'a [u8]) -> Result<Self> { todo!() }

    /// True if there are no bits left to read
    pub fn is_empty(&self) -> bool { todo!() }

    /// Get the given number of bits, or return an error.
    pub fn take(&mut self, len: usize) -> Result<u64> { todo!() }
}

Using the backward bit parser

✅ Add a method HuffmanDecoder::decode() which takes a mutable reference to a BackwardBitParser and returns the decoded symbol or an error.

You will probably have to add a new variant to HuffmanDecoder::Error in order to propagate a parsing error signaled by the BackwardBitParser.

Testing our parser and decoder together

Why not test our backward bit parser and our Huffman decoder together using the example given here?

#[test]
fn huffman_project_example() {
    // 0 repeated 65 times, 1, 2
    let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
    let decoder = HuffmanDecoder::from_weights(weights).unwrap();
    let mut parser = BackwardBitParser::new(&[0x97, 0x01]).unwrap();
    let mut result = String::new();
    while !parser.is_empty() {
        let decoded = decoder.decode(&mut parser).unwrap();
        result.push(decoded as char);  // We know they are valid A, B, or C char
    }
    assert_eq!(result, "BABCBB");
}