The Zstandard file format

What does data compression mean?

Data compression is a technique which takes advantage of the redundancy present in the data to represent it in less bytes than the original data. Compressed data takes less space in storage and takes less time to transmit on a communication channel.

Some compression techniques voluntarily lose less significant data to obtain better compression ratio, such as JPEG which compress images but may discard some colors nuances or other details. Some others are said to be lossless, and can reconstruct the original uncompressed data from the compressed one.

You probably already know some of those lossless common compression formats: zip, gzip, rar, gif, png, 7z or zstd. The one we are interested in is Zstandard, commonly abbreviated as zstd.

Zstandard is only a compression format. It does not take care of organizing files into archives as some utilities such as zip or rar do. It most ressembles gzip or 7z in that it only cares about one input and one output.

Zstandard in a nutshell: frames, blocks, literals and sequences

The Zstandard format (zstd) is described in RFC 8878. As for any standardized compressed format, only what goes in the compressed file and how it must be decompressed belong to the standard. A compressor is free to do whatever it needs as long as its output conforms to the standardized format.

A compressed zstd file is made of one or more frames, written after one another. Each frame can be decoded independently of the others. For example, you could take two compressed zstd files and concatenate them: it would make a valid zstd file, whose output would be the concatenation of the two original files.

Let's check this:

# Create two files f1.txt and f2.txt
$ echo "This is the content of the first file" > f1.txt
$ echo "This is the content of the second file" > f2.txt

# Compress each file with zstd
$ zstd f1.txt
f1.txt               :134.21%   (    38 B =>     51 B, f1.txt.zst)
$ zstd f2.txt
f2.txt               :133.33%   (    39 B =>     52 B, f2.txt.zst)
$ ls -l
-rw-r--r-- 1 sam users 38 Aug 11 12:36 f1.txt
-rw-r--r-- 1 sam users 51 Aug 11 12:36 f1.txt.zst
-rw-r--r-- 1 sam users 39 Aug 11 12:36 f2.txt
-rw-r--r-- 1 sam users 52 Aug 11 12:36 f2.txt.zst

# Concatenate f1.txt.zst and f2.txt.zst into f3.txt.fzst
$ cat f1.txt.zst f2.txt.zst > f3.txt.zst
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 103 Aug 11 12:37 f3.txt.zst

# Uncompress f3.txt.zst and display the result
$ unzstd f3.txt.zst    # or, equivalently: zstd -d f3.txt.zst
f3.txt.zst          : 77 bytes
$ ls -l f3.txt
-rw-r--r-- 1 sam users 77 Aug 11 12:37 f3.txt
$ cat f3.txt
This is the content of the first file
This is the content of the second file

You may have noted that the compression has not been very efficient: f1.txt was expanded (instead of shrinking) from 38 bytes to 51 bytes, and f2.txt from 39 bytes to 52 bytes. Also, f3.txt.zst took 103 bytes on disk, while the cleartext (decompressed) version is only 77 bytes long. This can be easily explained:

The contents of f1.txt and f2.txt did not contain obvious repetitions that could have been compressed. As a consequence, the compressed files include the uncompressed content and add some overhead pertaining to the zstd format.
The content of f3.txt.zst has been made by concatenating two inefficiently compressed files.

But what if we compress f3.txt on its own? After all, it contains some repetitions ("This is the content of the " and " file\n"):

# Compress f3.txt. Add -f to overwrite the existing f3.txt.zst without confirmation
$ zstd -f f3.txt
f3.txt               : 88.31%   (    77 B =>     68 B, f3.txt.zst)
$ ls -l f3.txt.zst
-rw-r--r-- 1 sam users 68 Aug 11 12:37 f3.txt

Interesting! This time, the file has really been compressed from 77 bytes down to 68 bytes.

If we look inside the compressed file f3.txt.zst, we can get some idea of what zstd did to compress the file:

$ strings f3.txt.zst
This is the content offirst file
second file

We can get an idea of what zstd will be doing to decompress this file:

Extract "This is the content of" from the compressed file
Copy " the " from the already decoded output
Extract "first file\n" from the compressed file
Copy "This is the content of the " from the already decoded output
Extract "second file\n" from the compressed file

How does the zstd program know what to extract and what to copy? A zstd frame is made of blocks. Each block contains literals, that is text that will be extracted into the decompressed file. Those literals are followed by a list of sequences, which are basic instructions of the form "extract the first M unused bytes from the literals into the decompressed file, then copy N bytes from the already decoded output starting P bytes before the current point.". If after executing all the sequences of the block some literals have not been extracted into the decompressed file, they are extracted as a final step.

In our case, here is the state of the literals and of the output after every sequence. The following notations will be used:

The next byte to use in the literals section is shown with a caret (^).
Spaces are represented by an underscore (_) to make them more visible in the output.
Minus signs (-) show the part from the already decoded output which is copied again, while plus signs (+) show its destination.
"\n" (line feed character) is shown with a dot (.).

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ |                                         |
     | ^                            |                                         |
     | file.second_file.            |                                         |
     |                              |                                         |

Executing the sequence (22, 5, 15) (extract 22 bytes from literal, then copy 5 bytes from 15 positions behind):

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_             |
     |                       ^      |        -----          +++++             |
     | file.second_file.            |                                         |
     |                              |                                         |

Executing the sequence (11, 27, 27):

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
     |                              | ---------------------------           + |
     | file.second_file.            | his_is_the_content_of_the_              |
     |      ^                       | ++++++++++++++++++++++++++              |

We have reached the end of the sequence, the rest of the literals are extracted into the output:

                 LITERALS                             OUTPUT
     | This_is_the_content_offirst_ | This_is_the_content_of_the_first_file.T |
     |                              |                                         |
     | file.second_file.            | his_is_the_content_of_the_second_file.  |
     |                  ^           |                                         |

👏🏾 If you undestand how frames, blocks, literals and sequences work together, you already know 50% of what you need to understand for completing this project.

Of course, for the compression algorithm to be effective, some work will be needed to generate or parse a compressed zstd file:

Literals can be present as-is (as in this example), or encoded more tightly using Huffman coding.
The Huffman trees are described using weight tables, a concise representation.
The Huffman weight tables may be themselves encoded more tightly using finite state entropy (FSE) coding.
Sequences parameters may be encoded tightly using FSE followed by an additional expansion step.
FSE tables coefficients use a concise bit-level representation.
Data decoded through FSE or Huffman coding is read backwards, a bit after another.
An optional checksum may be used to verify the integrity of the decoded data.

NET7212 project

The Zstandard file format

What does data compression mean?

Zstandard in a nutshell: frames, blocks, literals and sequences