Sequences

Sequences are the last pieces needed to decode Zstandard compressed frames. They will contain instructions on to how to combine data coming from the literals section of the current block.

Each sequence is made of three parts:

  1. The length in byte of the data to copy from the literals section to the output.
  2. How far behind in the already output data you can find data to emit, and
  3. how many bytes from this already output data to emit again.

Decoding example

Let's assume that your literals section decodes as "abcdefgh", and that your sequences section contains two sequences which, when fully decoded, contain: (3, 2, 3) and (2, 8, 1).

The first sequence (once decoded) is (3, 2, 3):

  • You copy 3 bytes of data from the literals section to the output: "abc"

  • The data to copy starts two bytes behind (one byte would be at "c", two bytes is at "b")

  • You repeatedly copy 3 bytes starting from this position:

    Current output: abc        Bytes repeated: 0
                     ^
    Current output: abcb       Bytes repeated: 1
                      ^
    Current output: abcbc      Bytes repeated: 2
                       ^
    Current output: abcbcb     Bytes repeated: 3
                        ^
    

Now, if the second sequence is (2, 8, 1):

  • You copy 2 bytes of data from the literals section to the output: "de"
  • You copy 1 byte of data, starting 8 bytes behind the current end of output:
    Current output: abcbcbde        Bytes repeated: 0
                    ^
    Current output: abcbcbdea       Bytes repeated: 1
                     ^
    

If there are no more sequences, you end up by copying the unused literal data ("fgh") to the output, giving the final output "abcbcbdeafgh".

Sequence decoding

Sequence decoding is described in section 3.1.1.3.2 of the RFC. You will first find a header describing the number of sequences with a variable size encoding. Then you will encounter a byte describing the three decoders:

  • 2 bits will describe the way literal lengths (the number of bytes to copy from the literals section) are encoded
  • 2 bits will describe the way offsets (how much to look back for data to repeat) are encoded
  • 2 bits will describe the way the match lengths (how much to repeat) are encoded

There are four ways of encoding each tables:

  • Predefined mode: a predefined FSE table, described in the RFC, must be used. No extra data is to be read from the stream for this table. Each table kind (literal lengths, offsets, and match lengths) get its own specific predefined table.
  • RLE (run length encoding) mode: the next byte in the stream is the value that must be repeatedly produced by this decoder.
  • FSE compressed mode: the FSE table to use must be read from the stream, using a forward bit parser.
  • Repeat mode: the last decoder used for this table kind must be used again. No extra data is needed from the stream.

Once the three decoders have been set, the data stream containing the sequences themselves must be decoded using a backwards bit parser on the rest of the block.

Decoding process

All three decoders are initialized by reading the number of bits they need for their respective initialization (the accuracy log for a FSE decoder, nothing for a RLE decoder) in this order:

  • the literals lengths decoder
  • the offsets decoder
  • the match lengths decoder

Each decoder emit a symbol in the order described below. Those symbols are not the value of the literals length, offset and match lengths: they use lookup tables, described in the RFC, that may require reading some more bits from the stream to determine the value. The order in which symbols are decoded is:

  • the offset
  • the match length
  • the literals length

Then the decoders are updated from the bit stream in this order:

  • the literals lengths decoder
  • the match lengths decoder
  • the offsets decoder

The final offset decoding process

The offset value must undergo one final transformation before it can be used, as described in the RFC: three repeat offsets (Offset1, Offset2, and Offset3) variables must be kept in the decoding context and used throughout the whole frame decoding. They are initialized to 1, 4, and 8 at the beginning of the frame.

The following process updates the three repeat offset values depending on the decoded Offset value and the literals length LL value. The real value to use for the offset is the one of Offset1 after this transformation:

Decoded Offset valueLLOffset1Offset2Offset3
30Offset1 - 1Offset1Offset2
3!= 0Offset3Offset1Offset2
20Offset3Offset1Offset2
2!= 0Offset2Offset1Offset3
10Offset2Offset1Offset3
1!= 0Offset1Offset2Offset3
> 3anyOffset - 3Offset1Offset2

Suggestions on code organization

For sequence execution

✅ Execute decoded sequences with a decoded literals section from the decoding context.

In addition to the three repeat offset fields and the three literals lengths, offsets, and match lengths decoder, the following methods can be added to DecodingContext:

impl DecodingContext {
   /// Decode an offset and properly maintain the three repeat offsets
    pub fn decode_offset(&mut self, offset: usize, literals_length: usize)
        -> Result<usize>
    {
        todo!()
    }

    /// Execute the sequences while updating the offsets
    pub fn execute_sequences(
        &mut self,
        sequences: Vec<(usize, usize, usize)>,
        literals: &[u8],
    ) -> Result<()> {
        todo!()   // Using the `Self::decode_offset()` method
    }
}}

For low-level sequence decoding

✅ Create a decoders::sequence module and a decoders::sequence::SequenceDecoder type. This type will contain references to the three bit decoders to use when decoding sequences.

A suggested implementation could be (in addition to the classical module-specific Error type):

pub struct SequenceDecoder<'d> {
    literals_lengths_decoder: &'d mut dyn BitDecoder<u16>,
    offsets_decoder: &'d mut dyn BitDecoder<u16>,
    match_lengths_decoder: &'d mut dyn BitDecoder<u16>,
}

impl BitDecoder<(u16, u16, u16), Error> for SequenceDecoder<'_> {
    …
}

A SequenceDecoder would produce triples of literals length, offset, and match lengths. The offset at this stage would not have been transformed using the repeat offsets described above, this would be the job of the DecodingContext::execute_sequences() method.

✅ Also, since sequences can use a RLE decoder instead of a FSE one, implement a RLEDecoder type in a decoders::rle module. This decoder will implement BitDecoder, and never require any bit from the bit stream, and always produce the same value.

For high-level sequence decoding

✅ Create a sequences module to parse and decode sequences from the following partial skeleton, and implement the methods (and other methods and types as needed):

pub struct Sequences<'a> {
    pub number_of_sequences: usize,
    pub literal_lengths_mode: SymbolCompressionMode,
    pub offsets_mode: SymbolCompressionMode,
    pub match_lengths_mode: SymbolCompressionMode,
    pub bitstream: &'a [u8],
}

impl<'a> Sequences<'a> {
    /// Parse the sequences data from the stream
    pub fn parse(mut input: ForwardByteParser<'a>) -> Result<Self> {
        todo!()
    }

    /// Return vector of (literals length, offset value, match length) and update the
    /// decoding context with the tables if appropriate.
    pub fn decode(self, context: &mut DecodingContext)
        -> Result<Vec<(usize, usize, usize)>>
    {
        todo!()
    }
}

pub enum SymbolCompressionMode {
    PredefinedMode,
    RLEMode(u8),
    FseCompressedMode(FseTable),
    RepeatMode,
}

At the block level

✅ Add a sequences field to the CompressedBlock variant of Block. Parse and decode the sequence when parsing and decoding a compressed block.

✅ Call the execute_sequences() method of the DecodingContext when decoding a Block to apply the decoded sequences to the decoded literals.

👍 Congratulations, you are now able to decode any Zstandard compressed file!