Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The JSON spec [1] clearly states, in section 9:

  An implementation may set limits on the maximum depth of
  nesting.
So, this is kind of a pointless exercise. The author would do well to inform themselves better before slinging epithets such as "bad" for a library.

[1]: https://tools.ietf.org/html/rfc7159



I would be perfectly OK with that answer if they documented that they use the C stack and therefore have a small finite depth limit, and/or threw some sort of TooMuchNesting exception before they hit that limit.

AFAICT, many/most of these libraries do neither of the above. There's nothing in the documentation to suggest they'll have trouble with deeply nested inputs, and their response to such a situation is to simply crash the process. I'd call that a bug.

I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons". If you've got limits, you've got to document them, or at least make them recoverable.


It's hard for a library to portably know when the C stack will exhaust or give a concrete estimate that makes sense for documentation purposes.

Firstly, that concept is not exposed at all to standard complaint C code. The standard is vague enough to allow for many possible variations of the program stack abstraction. So you would not be portable, standard C.

Second, the library author doesn't know ahead of time how much stack space you, the caller, have consumed before calling the library. Do you call it deep inside some ui framework with a big stack? With a huge i/o buffer on the stack? They don't know.

Translating all of this into "this is how deeply nested your json can be" is not a sane exercise, at best you can give a machine-specific guess that will vary a bit with your own application's stack needs.

But perhaps one can side-step all of that and get most of the way with an artificial, arbitrary-constant runtime depth check that is safely below likely actual stack limits.


Sure, that’s all true. So don’t use the C stack for parsing. It’s not difficult.


> I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons".

Welcome to C, enjoy your stay.


Most of these libraries are for HLLs.


I just wrote standard conforming JSON parser:

    cat standard-still-bad.sh

    #!/bin/sh
    echo "ERROR: JSON parser implementation limit." 1>&2
    exit 1 # terminate and indicate error

--- https://tools.ietf.org/html/rfc7159#section-9

> An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.


"Conforming" is not identical to "useful." A car that doesn't start is still a car.


It's funny, the spec they references[0] makes no mention of parsers or nesting. But RFC 8259[1] clearly does. Moreover, I notice they were both published in December 2017. Why two specs for the same standard?

[0] http://www.ecma-international.org/publications/files/ECMA-ST...

[1] https://tools.ietf.org/html/rfc8259


As is almost always the case when there are multiple specs purporting to describe the same thing: politics.

The Ecma spec is meant to be a spec defining a format and its semantics; the RFC places requirements on implementations of that format as well as describing it's semantics.


There was a rumor that was IETF was going to specify a bunch of new breaking changes in RFC 7195 (March 2014), and because of that (irrational) fear ECMA published the 1st edition of 404 very quickly in October of 2013.

RFC 8259 and the 2nd edition of ECMA-404 contain the same change, a single sentence, specifying that JSON must use UTF-8.


Maybe the headline is bad, but making a table of implemention limits of various JSON parsers seems useful?

Maybe we focus on headlines too much?


I dunno, the author claims they're sorting the table from worst to best, instead of by minimum maximum depth to maximum maximum depth


Maybe I shouldn't have called this repository "bad json parsers". I just added the following to the README:

> The name of this repository is intentionally provocative. It's goal is to document limitations of existing json parsers, not to denigrate the work of the many contributors who worked on the various libraries presented here.


The word you're looking for is clickbait ;)

"We Ranked The Top JSON Parsers By Call Stack Depth... Number 6 Will Shock You!"


First appreciate the change to the readme. I think that goes a long way.

Second, "bad" here is highly subjective. I've never ever hit the recursive depth limit in python with JSON, and if I do, I'll probably think about doing it a different way, in a way that won't require python to do deep recursion.


Its*. "It is goal" makes no sense.


fixed


Yeah, this is dumb.*

I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec) than not allowing thousands of nested arrays.

---

* Except that this is potential DoS vector. From that angle, it has some interest.


> I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec)

The spec has something specifically to say about this[0]:

> Note that when such software is used, numbers that are integers and are in the range [-(253)+1, (253)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Basically, allow up to 2^53, and you ought to be fine.

[0] https://tools.ietf.org/html/rfc8259#section-6


Agreed, such a parse would be "to spec". It says "An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers."

On a practical level, parsers not handling 64-bit ints has bit me more than parsers not handling 1000+ levels of nesting.


As a side note, the Haskell implementation also supports numbers of arbitrary size, again limited by memory, for both parsing and encoding.


From TFA

>The two most recent JSON standards RFC 8259 and RFC 7159 both say "An implementation may set limits on the maximum depth of nesting". However, the ECMA-404 specification doesn't contain any limit on how deeply nested JSON structures can be.


To be fair, that text was added after the above comment was posted.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...


Not stating a limit doesn't mean there's NO limit. It means the limit is implementation defined. If the way you interpret a spec implies a requirement of infinite memory or infinite stack, then the way you interpret it is wrong.


Reading "MUST" when specification says "may" is also wrong.


Nevertheless, the parser failing at only 100 levels of nesting is shockingly bad.


I work with complex JSON every day. Even painfully complex structures like Facebook's Ad Insights API or AWS Lambda Events only get around 5 levels deep, so 10 deep is already exotic, and I can't even conceive of a use case for 100. The method of failure is important too--would it really be better to just keep going forever, or die when maximum stack depth was reached, or the machine runs out of RAM (or swap space...)? Treating absurdly nested JSON the same way you would any invalid JSON input seems like a much friendlier and less dangerous approach overall.


Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser


Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.


No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.


Stick any structure in DynamoDB and you’ve already halved your nesting depth due to its encoding a superset off JSON in JSON. Then start encoding more complex structures like the ones the others have mentioned and these limitations start to bite pretty quickly.


Python handles function stack depth by limiting it to 1024, I’ve run into that limitation with recursive functions a few times but I can override it by setting a new maximum in code. Couldn’t similar be done with JSON?


> I can't even conceive of a use case for 100

A naive serialization of cons cells?


> I can't even conceive of a use case for 100

A serialized trie ?


Lots of recursive data-structures (trees/graphs/*) lend themselves nicely to lots of nesting. Such documents aren't really intended to be read/written by humans.


If they aren't meant to be read/written by humans, JSON is probably a bad choice.


I couldn't disagree more.

If you need to serialize and deserialize data between tools written in different languages -- e.g. write data to a file and read it in another tool -- then what would be better than JSON?

Nothing about JSON is inherently good or bad for deeply hierarchical/recursive data.


Deeply hierarchical/recursive data is equally well served by XML and Protobuf.

However, JSON itself can be... Painful for some subsets of data that you may wish to move between applications. Binary data, well-typed data. JSON tends [0] uses floats, so you may need to adjust your expectations about how to store numbers if precision matters. You can workaround a lot of these, such as through various encodings, but they are workarounds.

A lot of those workarounds depend on re-encoding, which unfortunately means using strings, which can run afoul of "An implementation may set limits on the length and character contents of strings." and silently drop important bytes, or simply cut off the end of the payload.

[0] The grammar is well-defined, the behaviour is not. "This specification allows implementations to set limits on the range and precision of numbers accepted." https://tools.ietf.org/html/rfc8259#page-7


Wouldn’t XML increase filesize even with compression applied? Obviously there are better machine readable libraries out there but XML is the middleground between efficiency and support


I would lean toward Protobuf wherever you can use it. It is designed to deal with issues like this. However, you can't always fulfill some of the requirements for it.

In those areas, XML strikes a practical balance. Especially if you can use a significant subset of XML. (If you don't need DOM, then use a parser that doesn't produce it, etc.)


I would say protobuffs


No, and if they can not be even read by a machine, it's bad.


This works out well in practice -- the ruby parser that ships with the language (the 100 levels one) works for most purposes and will even go back to an all-ruby implementation if it can't load its C extension. It's a reasonable default parser for almost all cases.

However, if you need the all-singing, all-dancing, handles the most complex use cases super fast parser, which is Oj at the bottom (which handles infinite levels), then you have the option to install it and use it.


Bad is a value statement, they didn't say not to spec, invalid, or similar.


I think you’re missing a link.


I don't like it when programs use physical limits to trigger physical errors based on the input. "JSON nesting depth of 23456 levels exceeds maximum memory of 23456MB" is a very different thing from being kill -9'd by the OOM killer (after of course it gets your ssh session).

If you decide your limit in advance and bail out of parsing, that's one thing. If you just use up all the stack and cross your fingers, that's a pain.


I know this is too late to be useful, but it only takes 1b per nesting level to track the nesting depth in O(1) time, or just a 64b integer to track the nesting depth in O(n) time. I wonder if there a sublinear algorithm? At best to me, right now, it could made to look like bitstream compression.


The author has now added this as an aside: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...


It is pointless, also nobody ever cares much how well a text editor can handle obscure text files (file too large, broken encoding, ....) What matters is how it handles well-formatted files, the others aren't worth opening in an editor anyways :)


The commenter would do well to RTFA before slinging complaints that have already been addressed in the article.

The article is cool-headed and factual, so don't get hung up on the repo name.


To be fair, the article was updated to address those complaints after the comment you're responding to was made.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...



It is better to not set limits.


There is always a limit. The question is whether you want to use up all available memory reading bad input, or reject it sooner. If you're worried about denial-of-service attacks, it makes sense to be explicit about how big an input you'll accept.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: