Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

[dead]


Bzip2 is slow. That’s the main issue. Gzip is good enough and much faster. Also, the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files is not a deal breaker, but it is annoying.

Gzip is woefully old. Its only redeeming value is that it's already built into some old tools. Otherwise, use zstd, which is better and faster, both at compression and decompression. There's no reason to use gzip in anything new, except for backwards compatibility with something old.

> Otherwise, use zstd, which is better and faster

Yes, I do. Zstd is my preferred solution nowadays. But gzip is not going anywhere as a fallback because there is a surprisingly high number of computers without a working libzstd.


One other redeeming quality that gzip/deflate does have is that its low memory requirements (~32 KB per stream). If you're running on an embedded device, or if you're serving a ton of compressed streams at the same time, this can be a meaningful benefit.

The claim that zstd is "better and faster", without additional qualifications, is false and misleading.

Indeed for many simple test cases zstd is both better and faster.

Despite that, it is possible to find input files for which zstd is either worse or slower than gzip, for any zstd options.

I have tested zstd a lot, but eventually I have chosen to use lrzip+gzip for certain very large files (tens of GB file size) with which I am working, because zstd was always either worse or slower. (For those files, at the same compression ratio and on a 6-year old PC, lrzip+gzip has a compression speed always greater than 100 MB/s, while zstd only of 30 to 40 MB/s.)

There are also other applications, where I do use zstd, but no compressing program is better than all the others in ALL applications.


bzip2 is particularly slow because the transform it depends on (BWT2) is "intrinsically slow" - it depends on cache-unfriendly operations with long dependency chains, preventing the CPU from extracting any parallelism:

https://cbloomrants.blogspot.com/2021/03/faster-inverse-bwt....


> the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files

TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.


tar packs multiple files into one. If you concatenate two gzipped files and unzip them, you just get a concatenated file.

Ah okay, I thought gzip would support decompressing multiple files that way.

How it works is, if you have two files foo.gz and bar.gz, and cat foo.gz bar.gz > foobar.gz, then foobar.gz is a valid gzip file and uncompresses to a single file with the contents of foo and bar.

It’s handy because it is very easy to just append stuff at the end of a compressed file without having to uncompress-append-recompress. It is a bit niche but I have a couple of use cases where it makes everything simpler.


tar supports that types of concatenation, so you can concatenate tar.gz files, and unpack them all into separate files

I know, but I've been always confused why a gzip file would have a filename field in its header if it's supposed to contain only one file. Obviously it's good to keep a backup of original filename somewhere, but it's confusing nonetheless.

the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.

that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip


Bzip2 is great for files that are compressed once, get decompressed many times, and the size is important. A good example is a software release.

So is xz, or zstd, and the files are smaller. bzip2 disappeared from software releases when xz was widely available. gzip often remains, as the most compatible option, the FAT32 of compression algorithms.

Huh? Only if it gets decompressed few times I would say, because it's so extremely slow at it

Like a software installation that you do one time. I'd not even want it for updates if I need to update large data files. The only purpose I'd see is the first-time install where users are okay waiting a while, and small code patches that are distributed to a lot of people

(Or indeed if size is important, but then again bzip2 only shines if it's text-like. I don't really find this niche knowledge for a few % optimization worth teaching tbh. Better teach general principles so people can find a fitting solution if they ever find themselves under specific constraints like OP)


> the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.

Long comment to just say: ‘I have no idea about what I’m writing about’

These compression algorithms do not have anything to do with filesystem structure. Anyway the reason you can’t cat together parts of bzip2 but you can with zstd (and gzip) is because zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.

So like, another place bzip2 sucks ass is working with large archives because you need to seek the entire archive before you can decompress it and it makes situations without parity data way more likely to cause dataloss of the whole archive. Really, don’t use it unless you have a super specific use case and know the tradeoffs, for the average person it was great when we would spend the time compressing to save the time sending over dialup.


> zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.

This isn't accurate.

1) Most zstd streams consist of a single frame. The compressor only creates multiple frames if specifically directed to do so.

2) bzip2 blocks, by contrast, are fully independent - by default, the compressor works on 900 kB blocks of input, and each one is stored with no interdependencies between blocks. (However, software support for seeking within the archive is practically nonexistent.)


So... it's actually a reasonable objection over bzip2? I mean, you explained why it does not work with bzip2.

I think their argument is sound and it makes using bzip2 less useful in certain situations. I was once saved in resolving a problem we had when I figured out that concatening gzipped files just works out of the box. If not, it would have meant a bit more code, lots of additional testing, etc.


bzip and gzip are both horrible, terribly slow. Wherever I see "gz" or "bz" I immediately rip that nonsense out for zstd. There is such a thing as a right choice, and zstd is it every time.

> Wherever I see "gz" or "bz"

That should not happen too often, considering that IIRC bzip lasted only a couple of months before being replaced by bzip2.


lz4 can still be the right choice when decompression speed matters. It's almost twice as fast at decompression with similar compression ratios to zstd's fast setting.

https://github.com/facebook/zstd?tab=readme-ov-file#benchmar...


pigz it's damn fast on compressing. Also, a Vax with NetBSD can run gzip. So here is it. Go try these new fancy formats on a Vax, I dare you.

And, yes, I prefer LZMA over the obsolete Bzip2 any day, but GZIP it's like the ZIP of free formats modulo packaging, which it's the job of TAR.


Neither has been good enough for years.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: