Literals

Compressed blocks are made of two parts:

  • the literals section
  • the sequences to act on those literals

The literals section

The literals section can be of three different types: a raw (as-is) representation, a RLE encoding (a byte repeated many times), or a compressed representation.

βœ… In a literals module, create a LiteralsSection enumeration with three variants: RawLiteralsBlock, RLELiteralsBlock and CompressedLiteralsBlock.

Each variant will be a struct-like one with some fields:

RawLiteralsBlock

A raw literals block encapsulate a reference to a slice of bytes. It means that the LiteralsSection type itself will have to take a lifetime as a generic parameter.

RLELiteralsBlock

A run-length-encoded literals block contains a byte and a number of repetitions of this byte.

CompressedLiteralsBlock

A compressed literals block contains more information:

  • huffman: an optional Huffman decoder. When the current block will be decoded, the decoder present in the context will be updated to this one and used. Otherwise, the decoder present in the context will be reused for the current block.
  • regenerated_size: the size of the regenerated data once decoded.
  • jump_table: the literals section is made of either one stream of compressed data, or four streams which are interleaved; in this case, the size of the 1st to the 3rd stream can be found in this field, while the one of the 4th stream can be deduced by subtracting thoses sizes from the total compressed size.
  • data: the compressed data as a reference to a slice of bytes.

Note that we do not create a separate TreelessLiteralsBlock variant to represent tree-less literals block as described in the standard: we use a CompressedLiteralsBlock with a huffman field set to None.

Parsing the literals section

βœ… Implement LiteralsSection::parse() with the following signature:

impl<'a> LiteralsSection<'a> {
    pub fn parse(input: &mut ForwardByteParser<'a>) -> Result<Self> {
        todo!
    }
}

Parsing the section involves decoding several steps:

  • The literals section header is one byte.
  • From this byte, several information can be extracted:
    • The block type
    • The format in which the regenerated size is stored
    • The regenerated size itself (most formats will require reading more bytes from the streaΓΉ)

In case of a compressed literals block, some more info must be extracted:

  • The compressed data size. Each size (regenerated and compressed) can be encoded either on 10, 14, or 18 bits.
  • A Huffman table, unless the block is a tree-less one which reuses the Huffman decoder from the previous block. The Huffman coefficients are compressed using FSE.
  • The jump table, as explained above.

⚠️ To be able to parse a compressed literals block, you need to parse Huffman tables coefficients from FSE, as is explained below. Use a todo!() or unimplemented!() while you implement Huffman table parsing.

Decoding the Huffman table

The description of a Huffman table in Zstandard compressed data is can take two forms. First, a byte \(N\) is read from the stream.

Direct encoded Huffman weights

If \(N \geq 128\), it means that \(N-127\) coefficients follow, each one of them using 4 bits. If the number of coefficients is odd, 4 bits are lost at the end. The rest of the stream is the Huffman-encoded data.

Compressed Huffman weights

If \(N \lt 128\), coefficients are described using an FSE encoding. \(N\) describes the number of bytes used to describe the FSE table and the Huffman table. They will be organized as follows in memory:

     FSE coefficients    Huffman table coefficients
     --------------->    <-------------------------
         F bytes                 N-F bytes
     <-------------------------------------------->
                    N bytes in total

The FSE table will be decoded and will consume \(F\) bytes in the process (the FSE table knows when it has reached the end of its coefficients). Then the Huffman table will be decoded from two alternating FSE decoders (see below) using the same FSE table, using the remaining \(N-F\) bytes in a backward bit parser.

The rest of the original stream, after those \(N\) bytes, is the Huffman-encoded data.

Alternating FSE decoders

To decode the Huffman table coefficients, two FSE decoders are used in succession. They are built from the same FSE table, which has just been decoded, but will hold separate states. Bopth decoders will consume bits from the same stream.

The first decoder will get initialized from the stream, then the second one. The first decoder will emit its initial state and consume some bits to get the next state, then the second decoder will do so as well. Back to the first decoder, which will emit a state and consume some bits, then the second one, and so on.

When a decoder tries to consume more bits than is available, it will complete them with zeroes to get to its last state, and indicate to the caller that the stream has been exhausted.

βœ… Implement a AlternatingDecoder structure in a decoders::alternating module which stores two FseDecoder objects as well as a boolean indicating which decoder has been updated last.

βœ… Implement AlternatingDecoder::new() which takes a FseTable. You might want to make the FseTable clonable, or find another way to share it.

βœ… Implement trait BitDecoder for AlternatingDecoder.

βœ… Write some tests for AlternatingDecoder.

Decode and build a Huffman table

βœ… Write a HuffmanDecoder::parse() method with the signature shown below. You might want to also build the parse_direct() and parse_fse() functions but feel free to organize your code as you wish.

impl HuffmanDecoder {
    /// Build a Huffman table from the given stream. Only the bytes needed to
    /// build the table are consumed from the stream.
    pub fn parse(input: &mut ForwardByteParser) -> Result<Self> {
        let header = input.u8()?;
        let weights = if header < 128 {
            Self::parse_fse(input, header)?
        } else {
            Self::parse_direct(input, header as usize - 127)?
        };
        Self::from_weights(weights)
    }

    /// Parse the Huffman table weights directly from the stream, 4
    /// bits per weights. If there are an odd number of weights, the
    /// last four bits are lost. `number_of_weights/2` bytes (rounded
    /// up) will be consumed from the `input` stream.
    fn parse_direct(input: &mut ForwardByteParser, number_of_weights: usize)
            -> Result<Vec<u8>>
    {
       todo!()
    }

    /// Decode a FSE table and use an alternating FSE decoder to parse
    /// the Huffman table weights. `compressed_size` bytes will be
    /// consumed from the `input` stream.
    fn parse_fse(input: &mut ForwardByteParser, compressed_size: u8)
            -> Result<Vec<u8>>
    {
       todo!()
    }
}

Back to parsing the literals section

You should now be able to parse the literals section and remove the todo!() markers you have set earlier in your code.

βœ… Complete the LiteralsSection::parse() code and test it.

Decoding the literals section

βœ… Implement LiteralsSection::decode().

impl<'a> LiteralsSection<'a> {
   /// Decompress the literals section. Update the Huffman decoder in
    /// `context` if appropriate (compressed literals block with a
    /// Huffman table inside).
    pub fn decode(self, context: &mut DecodingContext) -> Result<Vec<u8>> {
        todo!()
    }
}

Remember that you might have four streams to decode with the same Huffman decoder if the jump_table is not empty. Don't both doing it in parallel right now, you can decode them successively.

βœ… Add a CompressedBlock variant to Block, containing for now just a literals_section field.

βœ… When the block is a compressed block, add a call to LiteralsSection::parse() to Blocks::parse(). Discard the rest of the block for now, as we haven't implemented sequences parsing yet, this will be the next step.

βœ… When the block is a compressed block, add a call to LiteralsSection::decode() to Blocks::decode(), and display the decoded literals using compressed text files (most of them will be using compressed literal blocks). Even though you will not be able to rebuild the decompressed data, you should be able to make sense of the decoded literals.