Huffman decoder

You will now implement a Huffman decoder able to decode a bit stream into a set of bytes. Your implementation will be implemented as a tree as this is a direct mapping of the Huffman decoder concepts explained previously.

Huffman tree structure

Your Huffman tree structure will reside in the decoders/huffman module.

✅ Create a decoders module.

✅ Create a huffman module in the decoders module.

✅ Create a HuffmanDecoder enumerated type in the decoders/huffman module.

The HuffmanDecoder represents a node in the tree. It is either Absent, a Symbol (with a u8 payload) or a subtree represented by a Tree variant and two HuffmanDecoder values representing the left (0) and right (1) alternatives.

Note that you cannot put an object of type HuffmanDecoder inside a HuffmanDecoder itself, as the container object would have to have a greater size than the contained one, which is not possible as they have the same type. Also, you cannot use references either to represent the left and right alternatives: references do not represent ownership. Who would be the owner of the left and right alternatives?

You will have to use a type which acts like a reference (in that you don't have to store the whole object) and owns a subtree.

Inserting a symbol in a tree

✅ Write a HuffmanDecoder::insert() method which inserts a symbol in the tree at the leftmost possible free position for a given width, and returns true if it has succeded.

impl HuffmanDecoder {
    pub fn insert(&mut self, symbol: u8, width: u8) -> bool {
        todo!()
    }
}

Its behaviour will be:

  • If width is 0, the symbol must be inserted right at the top if the top is currently Absent. Otherwise this is a failure, and false must be returned.
  • Otherwise, the symbol must be inserted with a decreased width on the left branch, and if it fails on the right branch. Branches must be created (as absent) if the node doesn't contain them yet. If the top of the tree is a Symbol, just panic!() as this can should never happen. It would mean that it is impossible to insert the node without resolving to a symbol before the prefix length is reached.

For example, the tree shown in the Huffman code example can be built by running the following function:

fn build_example_tree() -> HuffmanDecoder {
    let mut tree = HuffmanDecoder::Absent;
    assert!(tree.insert(b'A', 2));
    assert!(tree.insert(b'C', 2));
    assert!(tree.insert(b'B', 1));
    tree
}

Printing a tree (optional)

You may want to implement Debug on your HuffmanDecoder type to list its content. While you may derive Debug automatically, you may want a better representation, showing the prefixes.

To do so, you can build an iterator which iterates over the nodes of your tree and their respective prefix. This iterator will keep a list of nodes to visit and their prefix internally, and yield pairs of (String, u8) which represent a prefix and the associated symbol.

✅ Build a HuffmanDecoderIterator<'a> structure which contains a list of nodes yet to visit, for example as a Vec<(&'a HuffmanDecoder, String)>.

✅ Implement Iterator for this structure: take the first node to visit by popping it from the vector. If it is a symbol, return it along with its prefix, if it is a subtree, note that you have to visit both branches (choose the order right) and try again with the node in your list, if there is nothing just go to the next node in your list. ✅ Implement Debug for HuffmanDecoder, taking advantage of the std::fmt::Formatter::debug_struct() method to structure your output.

If you do so, executing the following code:

fn main() {
   println!("{:?}", build_example_tree());
}

should print something like:

HuffmanDecoder { 00: 65, 01: 67, 1: 66 }

(because 65, 67, 66 are the ASCII codes for A, C and B respectively)

Building a tree from prefix widths

In a Zstandard file, a Huffman table will be described by succession of prefix widths (number of bits) for each symbol. After some decoding process that will be shown later, you will end up with a list of up to \(2^8\) integers, each representing the number of bits needed to represent each symbol. Missing ones mean, or prefix width set to 0, mean that this particular symbol will not be present in the output stream and thus will not appear in the Huffman table.

✅ Build a HuffmanDecoder::from_number_of_bits() constructor which takes a vector containing the number of bits for each symbol and returns a newly built HuffmanDecoder.

What you must do is:

  • Build a list of every symbol and its width.
  • Filter out zero width entries as those symbols will never appear in the output. They are present only because some later symbols in the list have a non-zero width.
  • In a newly created tree, insert every symbol at the right width, starting from the largest width and smallest symbol values within the same width. Make sure that you read the note on sort stability in Rust documentation.

Once you have done so, you should check that the following code builds the same tree as before:

fn main() {
    // 0 repeated 65 times, 2, 1, 2
    let widths: Vec<_> = std::iter::repeat(0).take(65).chain([2, 1, 2]).collect();
    println!("{:?}", HuffmanDecoder::from_number_of_bits(widths));
}

Representation through weights

Inside a Zstandard file, prefix widths are represented as weights:

  • A weight of 0 represents a symbol absent from the output.
  • If the longest prefix width is \(L\), a weight of \(w\) represents a prefix width of \(L+1-w\).

Moreover, three small additional optimizations are done to the list of weights present in a Zstandard file:

  • All 0 weights at the end of the list are absent. For example, a list of [2, 1, 0, 1, 0, 0…] will not contain the zeroes after the second 1.
  • The latest non-0 weight in the list is also absent. It can be computed by looking at the existing weights. For example, a list of [2, 1, 0, 1] will be represented as [2, 1, 0].
  • The longest prefix width is not indicated in the file. It must be deduced from the list of weights.

✅ Build a HuffmanDecoder::from_weights constructor which takes a vector of weights, represented as u8, and returns a result with the newly built HuffmanDecoder or an error. You must create a new local Error enum as you have done in other modules, as well as the Result type alias.

To compute the missing weight, which must be added to the list of weights received by the method, you can use the following properties:

  • If you sum \( 2^{w-1} \) for all weights \(w\) different from 0, you will get a power of 2.
  • No \( 2^{w-1} \) expression can be strictly greater than the sum of all others.

This should let you determine the missing last weight by ensuring that when added, the sum of \( 2^{w-1} \) rounds to the smallest strictly greater power of 2.

Once you have determined the missing weight, you can also deduce the number of bits needed to represent the weights. This can be easily done by taking the base 2 logarithm of the power of 2 you have reached.

You can now compute the prefix widths for all weights, and call your previously written HuffmanDecoder::from_number_of_bits() to build the Huffman decoder.

The following test code should build the same decoder as before:

fn main() {
    // 0 repeated 65 times, 1, 2
    let weights: Vec<_> = std::iter::repeat(0).take(65).chain([1, 2]).collect();
    println!("{:?}", HuffmanDecoder::from_weights(weights));
}

💡 You may have noticed that in our example above the vector of weights above contains a lot of 0 at the beginning. In fact, in a real weights array, many weights will be similar to each other, as they will all take a value between 0 and 8, inclusive. Does this ring a bell? Indeed, as we will see later, the weights are often themselves encoded using a FSE encoder which thrives in the presence of a small number of symbols.

In the next section, we will learn how to feed bits to our Huffman decoder and make it decode them.