Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And compression will never be able to eliminate brackets and commas.

Isn't that the whole point of compression: to minimise repetition of common features in data?



This'd be worth testing. Isn't the way gzip works to build up a dictionary of characters/phrases and then store pointers to the dictionary? I'm not sure you can compress { or } but }); is another story.


I just ran that experiment: https://gist.github.com/ggreer/4eb3ad61e97926e559f6

ASCII is extremely repetitive. 1MB of random braces gzips down to 163KB. That's a compression ratio of 6.1:1. Optimal compression would be close to 8:1 (since each byte really contains only one bit of information).


for (let i = 0; i < 1024 * 1024; i++) { buf += Math.random() > 0.5 ? "{" : "}"; }

Your sample data is bunk for determining the compression characteristics of braces in a realistic json file.


The point was to show that compression works at the level of bits, not characters. Realistic JSON is mostly content, which means your compression ratio depends more on content redundancies than braces or quotes. So too for binary serialization formats.


Gzip uses DEFLATE, which uses Huffman coding.

Huffman coding works by replacing common symbols with shorter codes and uncommon symbols with longer codes.

The codes used are therefore variable-length.

The codes are created in such a way that no code is the prefix of any other code.

That way, the decoder is able to know when it has reached the end of a code without the need for extra information other than the Huffman tree, which tells the decoder which codes belong to which symbols.

The Huffman tree can be pre-agreed or, more commonly, included with the compressed data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: