Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Parsing Office docs is parsing XML. We have lots of tools for parsing XML, and escaping, line continuation, text encoding etc. are all well-defined and don't need to be reimplemented specifically to support Office.

Whether parsing binary is easier or harder than parsing text depends almost completely on the grammar of the language being parsed; and let's not forget, text is, of course, a type of binary format.

If I have to do ad-hoc parsing or generation, I prefer a text format, because I have lots of tools that understand text. If I need to do production-quality work, I prefer a binary format, because I need to be complete. But if I'm integrating multiple heterogeneous systems, I want a format that is trivial to inspect and test; that may mean a well-specified text format, like JSON or XML.

I'm fairly sanguine about HTTP/2 because it's at a lower level. If I were in the business of writing HTTP clients or servers on a regular basis (rather than using existing libraries), I'd be more concerned. I only do a telnet HTTP/1.0 session every 4 months or so.



Regarding missing docs, I meant the original .doc. That was all undocumented, proprietary binary.

But again, I have to disagree about parsing text ever being easier than binary. Basically for the same protocol, passing the same data and implemented in a sane way, the text protocol is the same as binary + variable length metadata + data escaping + value conversion + text encoding of metadata. I'm happy to challenge anyone with the following: it's not possible to create a simpler text protocol than a well designed binary protocol. (looking only at encoding/decoding, not debugging side)

Where by simpler I mean, less likely to get exploited, less ambiguous, shorter to document (when concatenating with docs of all encoding protocols you depend on, like JSON or XML)


it's not possible to create a simpler text protocol than a well designed binary protocol

Huh? That's irrelevant, surely? It doesn't speak to your assertion. The best binary formats don't necessarily need "parsing" at all; it could be a simple matter of mapping into memory and adjusting offsets, like an OS loader. I don't think there's any debate that binary formats can be designed so that they are far easier to load than text. We're not talking about the design of protocols here (in this subthread). We're talking about writing parsers.

Parsing an obscure binary format is harder than parsing a simple text format.


I find the response confusing. First, I was responding to "Be it binary or text based, one has to write a parser anyway. Only with text based protocol one could also use Regex or string match during, which is quite useful for non-production development/testing." which we both seen to agree it's false. Well designed binary doesn't even need parsers sometimes. If you add steam multiplexing regexes won't help you anyway.

I'm not sure where do obscure binary formats come in. Http2 had a choice of slightly complicated binary, or more complicated text. Office had lots of programmers, even more money and simply didn't care. It's a completely different situation than http2.

So finally: why the obscure binary format? Http2's choice is good text or good binary really.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: