Parsing Office docs is parsing XML. We have lots of tools for parsing XML, and e...

viraptor · on June 2, 2015

Regarding missing docs, I meant the original .doc. That was all undocumented, proprietary binary.

But again, I have to disagree about parsing text ever being easier than binary. Basically for the same protocol, passing the same data and implemented in a sane way, the text protocol is the same as binary + variable length metadata + data escaping + value conversion + text encoding of metadata. I'm happy to challenge anyone with the following: it's not possible to create a simpler text protocol than a well designed binary protocol. (looking only at encoding/decoding, not debugging side)

Where by simpler I mean, less likely to get exploited, less ambiguous, shorter to document (when concatenating with docs of all encoding protocols you depend on, like JSON or XML)

barrkel · on June 7, 2015

it's not possible to create a simpler text protocol than a well designed binary protocol

Huh? That's irrelevant, surely? It doesn't speak to your assertion. The best binary formats don't necessarily need "parsing" at all; it could be a simple matter of mapping into memory and adjusting offsets, like an OS loader. I don't think there's any debate that binary formats can be designed so that they are far easier to load than text. We're not talking about the design of protocols here (in this subthread). We're talking about writing parsers.

Parsing an obscure binary format is harder than parsing a simple text format.

viraptor · on June 11, 2015

I find the response confusing. First, I was responding to "Be it binary or text based, one has to write a parser anyway. Only with text based protocol one could also use Regex or string match during, which is quite useful for non-production development/testing." which we both seen to agree it's false. Well designed binary doesn't even need parsers sometimes. If you add steam multiplexing regexes won't help you anyway.

I'm not sure where do obscure binary formats come in. Http2 had a choice of slightly complicated binary, or more complicated text. Office had lots of programmers, even more money and simply didn't care. It's a completely different situation than http2.

So finally: why the obscure binary format? Http2's choice is good text or good binary really.