The JSON spec [1] clearly states, in section 9: An implementation may set limits...

ken · on Nov 8, 2019

I would be perfectly OK with that answer if they documented that they use the C stack and therefore have a small finite depth limit, and/or threw some sort of TooMuchNesting exception before they hit that limit.

AFAICT, many/most of these libraries do neither of the above. There's nothing in the documentation to suggest they'll have trouble with deeply nested inputs, and their response to such a situation is to simply crash the process. I'd call that a bug.

I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons". If you've got limits, you've got to document them, or at least make them recoverable.

asveikau · on Nov 9, 2019

It's hard for a library to portably know when the C stack will exhaust or give a concrete estimate that makes sense for documentation purposes.

Firstly, that concept is not exposed at all to standard complaint C code. The standard is vague enough to allow for many possible variations of the program stack abstraction. So you would not be portable, standard C.

Second, the library author doesn't know ahead of time how much stack space you, the caller, have consumed before calling the library. Do you call it deep inside some ui framework with a big stack? With a huge i/o buffer on the stack? They don't know.

Translating all of this into "this is how deeply nested your json can be" is not a sane exercise, at best you can give a machine-specific guess that will vary a bit with your own application's stack needs.

But perhaps one can side-step all of that and get most of the way with an artificial, arbitrary-constant runtime depth check that is safely below likely actual stack limits.

ken · on Nov 9, 2019

Sure, that’s all true. So don’t use the C stack for parsing. It’s not difficult.

taneq · on Nov 9, 2019

> I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons".

Welcome to C, enjoy your stay.

ken · on Nov 9, 2019

Most of these libraries are for HLLs.

nabla9 · on Nov 9, 2019

I just wrote standard conforming JSON parser:

    cat standard-still-bad.sh

    #!/bin/sh
    echo "ERROR: JSON parser implementation limit." 1>&2
    exit 1 # terminate and indicate error

--- https://tools.ietf.org/html/rfc7159#section-9

> An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.

otterley · on Nov 10, 2019

"Conforming" is not identical to "useful." A car that doesn't start is still a car.

Svip · on Nov 8, 2019

It's funny, the spec they references[0] makes no mention of parsers or nesting. But RFC 8259[1] clearly does. Moreover, I notice they were both published in December 2017. Why two specs for the same standard?

[0] http://www.ecma-international.org/publications/files/ECMA-ST...

[1] https://tools.ietf.org/html/rfc8259

gsnedders · on Nov 8, 2019

As is almost always the case when there are multiple specs purporting to describe the same thing: politics.

The Ecma spec is meant to be a spec defining a format and its semantics; the RFC places requirements on implementations of that format as well as describing it's semantics.

snek · on Nov 9, 2019

There was a rumor that was IETF was going to specify a bunch of new breaking changes in RFC 7195 (March 2014), and because of that (irrational) fear ECMA published the 1st edition of 404 very quickly in October of 2013.

RFC 8259 and the 2nd edition of ECMA-404 contain the same change, a single sentence, specifying that JSON must use UTF-8.

skybrian · on Nov 8, 2019

Maybe the headline is bad, but making a table of implemention limits of various JSON parsers seems useful?

Maybe we focus on headlines too much?

maplant · on Nov 8, 2019

I dunno, the author claims they're sorting the table from worst to best, instead of by minimum maximum depth to maximum maximum depth

lovasoa · on Nov 8, 2019

Maybe I shouldn't have called this repository "bad json parsers". I just added the following to the README:

> The name of this repository is intentionally provocative. It's goal is to document limitations of existing json parsers, not to denigrate the work of the many contributors who worked on the various libraries presented here.

Igelau · on Nov 9, 2019

The word you're looking for is clickbait ;)

"We Ranked The Top JSON Parsers By Call Stack Depth... Number 6 Will Shock You!"

bb88 · on Nov 8, 2019

First appreciate the change to the readme. I think that goes a long way.

Second, "bad" here is highly subjective. I've never ever hit the recursive depth limit in python with JSON, and if I do, I'll probably think about doing it a different way, in a way that won't require python to do deep recursion.

elpocko · on Nov 8, 2019

Its*. "It is goal" makes no sense.

lovasoa · on Nov 8, 2019

fixed

paulddraper · on Nov 8, 2019

Yeah, this is dumb.*

I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec) than not allowing thousands of nested arrays.

---

* Except that this is potential DoS vector. From that angle, it has some interest.

Svip · on Nov 8, 2019

> I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec)

The spec has something specifically to say about this[0]:

> Note that when such software is used, numbers that are integers and are in the range [-(253)+1, (253)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Basically, allow up to 2^53, and you ought to be fine.

[0] https://tools.ietf.org/html/rfc8259#section-6

paulddraper · on Nov 8, 2019

Agreed, such a parse would be "to spec". It says "An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers."

On a practical level, parsers not handling 64-bit ints has bit me more than parsers not handling 1000+ levels of nesting.

axman6 · on Nov 9, 2019

As a side note, the Haskell implementation also supports numbers of arbitrary size, again limited by memory, for both parsing and encoding.

andrepd · on Nov 8, 2019

From TFA

>The two most recent JSON standards RFC 8259 and RFC 7159 both say "An implementation may set limits on the maximum depth of nesting". However, the ECMA-404 specification doesn't contain any limit on how deeply nested JSON structures can be.

ddlatham · on Nov 8, 2019

To be fair, that text was added after the above comment was posted.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

mantap · on Nov 8, 2019

Not stating a limit doesn't mean there's NO limit. It means the limit is implementation defined. If the way you interpret a spec implies a requirement of infinite memory or infinite stack, then the way you interpret it is wrong.

EugeneOZ · on Nov 8, 2019

Reading "MUST" when specification says "may" is also wrong.

vesinisa · on Nov 8, 2019

Nevertheless, the parser failing at only 100 levels of nesting is shockingly bad.

femto113 · on Nov 8, 2019

I work with complex JSON every day. Even painfully complex structures like Facebook's Ad Insights API or AWS Lambda Events only get around 5 levels deep, so 10 deep is already exotic, and I can't even conceive of a use case for 100. The method of failure is important too--would it really be better to just keep going forever, or die when maximum stack depth was reached, or the machine runs out of RAM (or swap space...)? Treating absurdly nested JSON the same way you would any invalid JSON input seems like a much friendlier and less dangerous approach overall.

nostrademons · on Nov 9, 2019

Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser

markpapadakis · on Nov 9, 2019

Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.

nostrademons · on Nov 9, 2019

No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.

axman6 · on Nov 9, 2019

Stick any structure in DynamoDB and you’ve already halved your nesting depth due to its encoding a superset off JSON in JSON. Then start encoding more complex structures like the ones the others have mentioned and these limitations start to bite pretty quickly.

Bnshsysjab · on Nov 9, 2019

Python handles function stack depth by limiting it to 1024, I’ve run into that limitation with recursive functions a few times but I can override it by setting a new maximum in code. Couldn’t similar be done with JSON?

ken · on Nov 8, 2019

> I can't even conceive of a use case for 100

A naive serialization of cons cells?

lovasoa · on Nov 8, 2019

> I can't even conceive of a use case for 100

A serialized trie ?

ryanianian · on Nov 8, 2019

Lots of recursive data-structures (trees/graphs/*) lend themselves nicely to lots of nesting. Such documents aren't really intended to be read/written by humans.

cortesoft · on Nov 8, 2019

If they aren't meant to be read/written by humans, JSON is probably a bad choice.

crazygringo · on Nov 8, 2019

I couldn't disagree more.

If you need to serialize and deserialize data between tools written in different languages -- e.g. write data to a file and read it in another tool -- then what would be better than JSON?

Nothing about JSON is inherently good or bad for deeply hierarchical/recursive data.

shakna · on Nov 9, 2019

Deeply hierarchical/recursive data is equally well served by XML and Protobuf.

However, JSON itself can be... Painful for some subsets of data that you may wish to move between applications. Binary data, well-typed data. JSON tends [0] uses floats, so you may need to adjust your expectations about how to store numbers if precision matters. You can workaround a lot of these, such as through various encodings, but they are workarounds.

A lot of those workarounds depend on re-encoding, which unfortunately means using strings, which can run afoul of "An implementation may set limits on the length and character contents of strings." and silently drop important bytes, or simply cut off the end of the payload.

[0] The grammar is well-defined, the behaviour is not. "This specification allows implementations to set limits on the range and precision of numbers accepted." https://tools.ietf.org/html/rfc8259#page-7

Bnshsysjab · on Nov 9, 2019

Wouldn’t XML increase filesize even with compression applied? Obviously there are better machine readable libraries out there but XML is the middleground between efficiency and support

shakna · on Nov 9, 2019

I would lean toward Protobuf wherever you can use it. It is designed to deal with issues like this. However, you can't always fulfill some of the requirements for it.

In those areas, XML strikes a practical balance. Especially if you can use a significant subset of XML. (If you don't need DOM, then use a parser that doesn't produce it, etc.)

cortesoft · on Nov 9, 2019

I would say protobuffs

vesinisa · on Nov 8, 2019

No, and if they can not be even read by a machine, it's bad.

corprew · on Nov 8, 2019

This works out well in practice -- the ruby parser that ships with the language (the 100 levels one) works for most purposes and will even go back to an all-ruby implementation if it can't load its C extension. It's a reasonable default parser for almost all cases.

However, if you need the all-singing, all-dancing, handles the most complex use cases super fast parser, which is Oj at the bottom (which handles infinite levels), then you have the option to install it and use it.

bryanrasmussen · on Nov 8, 2019

Bad is a value statement, they didn't say not to spec, invalid, or similar.

saagarjha · on Nov 8, 2019

I think you’re missing a link.

jrockway · on Nov 9, 2019

I don't like it when programs use physical limits to trigger physical errors based on the input. "JSON nesting depth of 23456 levels exceeds maximum memory of 23456MB" is a very different thing from being kill -9'd by the OOM killer (after of course it gets your ssh session).

If you decide your limit in advance and bail out of parsing, that's one thing. If you just use up all the stack and cross your fingers, that's a pain.

thechao · on Nov 10, 2019

I know this is too late to be useful, but it only takes 1b per nesting level to track the nesting depth in O(1) time, or just a 64b integer to track the nesting depth in O(n) time. I wonder if there a sublinear algorithm? At best to me, right now, it could made to look like bitstream compression.

bspammer · on Nov 9, 2019

The author has now added this as an aside: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

blablabla123 · on Nov 9, 2019

It is pointless, also nobody ever cares much how well a text editor can handle obscure text files (file too large, broken encoding, ....) What matters is how it handles well-formatted files, the others aren't worth opening in an editor anyways :)

mike_hock · on Nov 8, 2019

The commenter would do well to RTFA before slinging complaints that have already been addressed in the article.

The article is cool-headed and factual, so don't get hung up on the repo name.

ddlatham · on Nov 8, 2019

To be fair, the article was updated to address those complaints after the comment you're responding to was made.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

stevula · on Nov 8, 2019

https://tools.ietf.org/html/rfc7159#section-9

gameswithgo · on Nov 8, 2019

It is better to not set limits.

skybrian · on Nov 8, 2019

There is always a limit. The question is whether you want to use up all available memory reading bad input, or reject it sooner. If you're worried about denial-of-service attacks, it makes sense to be explicit about how big an input you'll accept.