This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803
It's good, but is it "the future" when it's extra work?
Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size
It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.
The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.
These and other email specific redundancies ought to be covered by any specialized compression scheme. Also note, a lot of standard compression is deduplication. Fundamentally they are not that different.
Given that one needs to support deletes, this will end up looking like a garbage collected deduplication file system.
I kind of doubt they would do this to be honest. Every near-copy of a message is going to have small differences in at least the envelope (not sure if encoding differences are also possible depending on the server), and possibly going to be under different guarantees or jurisdictions. And it would just take one mistake to screw things up and leak data from one person to another. All for saving a few gigabytes over an account's lifetime. Doesn't really seem worth it, does it?
That's why a base and a delta. Whereas PP was talking about general compression algorithm, my question was different.
In line with the original comment, I was asking about specialized "codecs" for gmail.
Humans do not read the same email many times. That makes it a good target for compression. I believe machines do read the same email many times, but that could be architected around.