Early on the article mentions that xz have zstd have gotten more popular than bz...

0cf8612b2e1e · 2026-03-14T20:05:06 1773518706

There was an analysis which argued that zstd is pareto optimal.

https://insanity.industries/post/pareto-optimal-compression/

lucb1e · 2026-03-15T05:02:49 1773550969

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)

Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name

bonzini · 2026-03-15T07:24:37 1773559477

> meant to have good ratios and good (de)compression speeds as compared to other tools

That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides

          ___.--- higher throughput
        /     LZ4
       / zStd
      |
      ; LZMA
     |
     | ZPAQ
    lower size

Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.

The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).

rurban · 2026-03-15T15:11:54 1773587514

This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip

lucb1e · 2026-03-16T01:35:31 1773624931

Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.

At half the size of xz's output, there can be applications for this, but you need to:

- not care about compression time

- not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)

- not care about decompression time either

- and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)

usefulcat · 2026-03-15T01:57:29 1773539849

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.

jgalt212 · 2026-03-15T12:30:30 1773577830

10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.

Sesse__ · 2026-03-15T15:10:35 1773587435

xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).

darkwater · 2026-03-15T08:53:06 1773564786

You can pry bzip from my dead cold hands.

int_19h · 2026-03-15T03:49:08 1773546548

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

lucb1e · 2026-03-15T03:59:45 1773547185

This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time

OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast

This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)

Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder

necovek · 2026-03-15T10:14:19 1773569659

I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.

Or perhaps this is already being done this way and it is a bottleneck?

This might not work for multi-gigabyte files, but most of text content is well under 1MiB.

lucb1e · 2026-03-15T23:47:09 1773618429

This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit

Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm

Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)

ac29 · 2026-03-15T04:15:19 1773548119

I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression

adrian_b · 2026-03-15T11:52:24 1773575544

I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.

Unfortunately, later I have also found other cases where the bzip3 performance was not so good.

I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.

Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.