I work with complex JSON every day. Even painfully complex structures like Faceb...

nostrademons · on Nov 9, 2019

Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser

markpapadakis · on Nov 9, 2019

Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.

nostrademons · on Nov 9, 2019

No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.

axman6 · on Nov 9, 2019

Stick any structure in DynamoDB and you’ve already halved your nesting depth due to its encoding a superset off JSON in JSON. Then start encoding more complex structures like the ones the others have mentioned and these limitations start to bite pretty quickly.

Bnshsysjab · on Nov 9, 2019

Python handles function stack depth by limiting it to 1024, I’ve run into that limitation with recursive functions a few times but I can override it by setting a new maximum in code. Couldn’t similar be done with JSON?

ken · on Nov 8, 2019

> I can't even conceive of a use case for 100

A naive serialization of cons cells?

lovasoa · on Nov 8, 2019

> I can't even conceive of a use case for 100

A serialized trie ?