General principles

You are free to organize your project as your wish, but some general principles will be respected (unless you convince us by discussing it on the mailing-list).

💡 Do not hesitate to come back to this page frequently to ensure you respect those design principles.

⚠️ You are not supposed to implement everything described in the current page at the beginning of the project. Various parts will be described in more details in subsequent sections.

Library vs. program

While we ultimately want to produce an executable program that will take arguments from the command line, we also want all capabilities we offer to be available as a library for at least two reasons:

  1. We want other programs to be able to include our decoder as a dependency.
  2. We want to be able to test various functions using unit tests.

Your project will contain, in the same crate:1

  • a library
  • an executable

The executable should be reduced to handling the command line arguments and calling the right library functions. Its job should be kept to a minimum.

Error handling

Every module whose functions return a Result should have its own Error type, and possibly its own Result shortcut. Using the thiserror crate in the library allows easy embedding and conversion of one error type to another:

// File "mymod.rs"

#[derive(Debug, thiserror::Error)]
pub enum Error {
    // Encapsulate an I/O error without adding any more context
    // and add a `impl From<std::io::Error> for Error` implementation.
    #[error(transparent)]
    Io(#[from] std::io::Error)
    
    // Custom error originating from this module
    #[error("bad magic found {found:#06x} instead of expected {expected:#06x}")]
    BadMagic { expected: u32, found: u32 }
    
    // Encapsulate an error from another module while adding context
    #[error("corrupted Huffman table weights")]
    CorruptedWeights(#[source] fse::Error)
}

// Define a local `Result` type alias which can be used with one argument,
// the default for the second type being defined as the local `Error` type.
pub type Result<T, E = Error> = std::result::Result<T, E>;

// Example function returning a `Result` without repeating the `Error` type.
// The return type can be written as:
// - `Result<Vec<u8>>` in this module, or `Result<Vec<u8>, Error>`
// - `mymod::Result<Vec<u8>>` or `Result<Vec<u8>, mymod::Error>` from outside
fn read_file(filename: &str) -> Result<Vec<u8>> {
    Ok(std::file::read(filename)?)
}

The executable can implement generic error handling by using, for example, anyhow, or the combination of eyre and color-eyre for a more pleasant and colorful error output2.

Consuming the input

We will need to use several parser to handle the compressed content:

  • A main forward byte-oriented one, which delivers the bytes in the input in order.
  • A forward bit-oriented sub-parser to decode the coefficients of a FSE table.
  • A backward bit-oriented sub-parser to decode FSE-encoded or Huffman-encoded data.

The main parser should have the following properties:

  • It can preload data on-demand, for example if all input bytes have been consumed, or if the decoder needs to access bytes located later in the input.
  • It can lend a reference to the data to a sub-parser without copying it.
  • It may discard data that has been either consumed directly or through a sub-parser.

💡 At the beginning of the project, having the data to decode in memory as a slice of bytes might be sufficient to satisfy the requirements of the main parser.

Producing the output

Producing the output is simpler than parsing the input, as all transactions are byte-oriented and going forward. Our output producer, dedicated to a frame, should have the following properties:

  • It should emit data as soon as possible but not sooner: decoding a frame should only output its uncompressed data when previous frames have finished outputing theirs.
  • It should discard data as soon as possible, but not sooner: later blocks in the same frame may need to copy already-produced data.

Decoding a frame

Within a frame, later blocks might refer to decoding tables or data structures used in previous ones. For example, if two successive blocks use the same Huffman weights to encode the literals in both blocks, the second block will only contain a "use the previous Huffman decoder" marker.

You will need to maintain a shared state between successive blocks of the same frame. This state must contain:3

  • The three repeat offsets.
  • The latest decoders used to decode the constituents of a sequence (literals lengths, offsets, and match lenghts).
  • The latest Huffman decoder used to decode literals.
  • Some high-level information about the frame such as the window size.
  • The current checksum of the data being decoded in this frame.
1

Do not worry if you do not yet understand the Rust-specific parts of the project specification at first (e.g., crates, Result, thiserror, etc.), you will learn about all of them later during the course.

2

If your program depends on the color-eyre crate, you may omit the eyre dependency if you wish, as color-eyre depends on eyre and reexports it. You can use color_eyre::eyre; in your code then use eyre::Result.

3

Do not worry if some concepts seem obscure at the beginning of the project, you will discover and understand them later while reading the zstd standard.