Comment by camgunz
5 months ago
Disclaimer: I wrote and maintain a MessagePack implementation.
Hey that's me!
Yeah they fixed that, but there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP (MP does have extension types, I should say, the virtue of tossing out CBOR's tags is you then don't have to implement things like datetimes/timezones, bignums, etc), and indeed at least FIDO tosses this all out: https://fidoalliance.org/specs/fido-v2.0-ps-20190130/fido-cl...
Beyond that, CBOR is MessagePack. The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway. All this design by committee stuff is mostly wrong--though IETF has modified it by committee since.
There's no reason an MP implementation has to be slower than a CBOR implementation. If a given library wanted to be very fast it could be. If anything, the fact that CBOR more or less requires you to allocate should put a ceiling on how fast it can really be. Or, put another way, benchmarks of dynamic language implementations of a serialization format aren't a high signal indication of its speed ceiling. If you use a dynamic language and speed is a concern to this degree, you'd write an adapter yourself, probably building on one of the low level implementations.
That said, people are usually disappointed by MP's speed over JSON. A lot of engineering hours have gone into making JSON fast, to the point where I don't think it ever made sense to choose MP over it for speed reasons (there are other good reasons). Other posters here have pointed out that your metrics are usually dominated by something else.
But finally, CBOR is fine! The implementations are good and it's widely used. Users of CBOR and MP alike will probably have very similar experiences unless you have a niche use case (on an embedded device that can't allocate, you really need bignums, etc).
> there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP [...].
I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?
But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way. Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property. Tag facilities are well thought out in my opinion, while specific tags are less so but implementations can choose to ignore them anyway---thankfully after the aforementioned revisions.
> The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway.
To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts. So people did their own standardization including CommonMark, which subtly differ from each other, but any further efforts would be inadvently blocked by Gruber. MessagePack seems no different to me.
> To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts.
Right, that was my take after reading about it for a while. The way MessagePack and CBOR frame the problem is fairly different, with CBOR intentionally opting for an open tagging system.
Plus it feels a bit childish brining up the circumstances of the fork (correct or not) when they've clearly diverged bit in purpose and scope. The Markdown vs CommonMark is an apt comparison.
I've used both and both work very well. They're both stable, and be parsed into native objects at a speed nearing that of memory copy with the right implementations.
> CBOR intentionally opting for an open tagging system
CBOR's tags are MP's extension types
6 replies →
> I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?
That's fair; I've been a little confusing when I say things like "CBOR is MessagePack". To be clear I mean that CBOR's format is fundamentally MessagePack's, and my issues are with the stuff added beyond that.
> But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way.
Totally! I think MP's extension types (CBOR's "tags") are pretty perfect for this. I mean, bignums or datetimes or whatever are often useful, and supporting extension types/tags in an implementation is really straightforward. I just don't think stuff like this belongs in a data representation format. There's a reason JSON, gRPC, Cap'n Proto, Thrift, etc. don't even support datetimes.
> Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property.
This is the example I always have in my head too. But canonicalization puts some significant requirements on a serializer. Like, when do you enable canonicalization? CBOR limits the feature set when canonicalizing, so you can do it up front and error if someone tries to add an indefinite-length type, or you can do it at the end and error then, and this by itself is a big question. You also have to recursively descend through any potentially nested type and canonicalize it. What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].
But alright, you can make a reasonable library even aside from all this stuff. But are you really just trusting that things are canonicalized receiver side? Definitely not, so you do a lot of validation on your own which pretty much obviates any advantage you might get. JWT is a great use-case of people assuming the JSON was well-formed: canonicalized or not, you gotta validate. You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.
> To me it looks more like Markdown vs. CommonMark disputes
There was some of this because of the bytes vs. strings debate. Basically people were like, "hey wait, when I deserialize in a dynamic language that assumes strings are UTF-8, I get raw byte strings? I don't like that", but on the other hand Treasure Data (MP creators) had lots of data already stored in the existing format so they needed (well, wanted anyway) a solution that was backwards compatible, plus you want to consider languages that don't really know about things like UTF-8 or use something else internally (C/C++, Java, Python for a while). That's where MPv5 came from, and the solution is really elegant. If CBOR was MPv4 + strings I'd 100% agree with you, but it's a kitchen sink of stuff that's pretty poorly thought out. You can see this in the diversity of support in CBOR implementations. I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2]. Go's is really comprehensive [3] but the lengths it has to go to (internal buffer pools, etc) are pretty bananas and beyond what you'd expect for a data representation format, plus it has knobs like disabling indefinite-length encodings, limiting the sizes of them, limiting the stack depth for nesting, and so on, again really easy to get into trouble here.
[0]: https://datatracker.ietf.org/doc/html/rfc8949#name-specifyin...
[1]: https://datatracker.ietf.org/doc/html/rfc8949#name-serializa...
[2]: https://github.com/enarx/ciborium/issues/144
[3]: https://github.com/fxamacker/cbor
> What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].
I agree on this point, DAG-CBOR in my knowledge is defined to avoid such pitfall. Again, we can agree that Bormann is not a good spec writer nor a good communicator regardless of his design skill.
> You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.
> I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2].
However this argument is... absurd to be frank. Canonicalization is an additional stuff and not every implementation is going to implement that. More specifically, I'm only leaning on the fact that there is a single defined canonicalization scheme that can be leveraged by any interested user, not that it is mandatory (say, unlike bencode) because canonicalization and other stuffs naturally require different API designs anyway.
Let's think about a concrete case of sorted keys in maps. Most implementations are expected to return a standard mapping type for them because that's natural to do so. But many if not most mapping types are not sorted by keys. (Python is a rare counterexample AFAIK, but its decision to order keys by default was motivated by the exact point I'm about to say.) So you have to shift the burden of verification to the implementation, or you need an ordered key iterator API which will remain a niche. We seem to agree that the canonicalization itself has to be done somewhere, but we ended up with an implementation burden wherever we put the verification step. So this is not a good argument against format-standardized canonicalization at all.
1 reply →
> The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway. All this design by committee stuff is mostly wrong--though IETF has modified it by committee since.
The IETF does not have a committee process. The CBOR RFC has 2 authors, Carsten Bormann and Paul Hoffman. Authors bring documents into the IETF, and the process is basically that everyone bashes¹ on it (the doc, not the people, please) until either there's a reasonable amount of agreement or they give up.
And everyone here means everyone. You could've sent a mail to the mailing list to bash on CBOR. Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?
One of very few things that won't fly there is that standardization in general is bad, because the IETF doesn't believe that. But that's only the general argument — "standardizing this particular thing is bad" can and has gone through before. From some sibling comments I see this may have been a major point of contention, but I don't know it was the only one. *If* it is, it's in poor faith to drag this personal dispute into the discussion (I don't know what other disagreements and bad faith there were.)
¹ bashing here means pointing out flaws. It's up to the authors to make text changes to address them.
Aren't there a bunch of emails and what-not about it? I think that's what people are referring to.
EDIT:
> Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?
Yes
> Aren't there a bunch of emails and what-not about it? I think that's what people are referring to.
Sorry, what "it"/"that" is this? I'm failing to process due to unclear references.
> > Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?
> Yes
Can you point to anything? Best I can find is https://mailarchive.ietf.org/arch/msg/apps-discuss/iZM_ZqA9i... but that's not particularly useful. Boils down to questioning the utility of standards...
6 replies →
> Yeah they fixed that, but there's other parts of the spec that are basically unworkable
Yeah it just made me chuckle cause it was such an obvious oversight and a fun way of pointing it out. That said I totally get that writing specs are hard, so not dissing the authors as such.
> There's no reason an MP implementation has to be slower than a CBOR implementation.
Yeah that also struck me. Like ok that CBOR library might be faster than that MP library, but could be either is just missing some optimizations. And it didn't look like there were orders-of-magnitude differences in either case.
Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).
> That said I totally get that writing specs are hard, so not dissing the authors as such.
Oh definitely. Yeah maybe I come off as anti-spec or something, but in this case I just think MP was really well thought out, and then Bormann hung a bunch of stuff on it that really wasn't, and I'm salty haha.
> Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).
Whaa? I wrote an MP implementation specifically for this use case: https://github.com/camgunz/cmp. JSON parsing terrifies me; there was some table of tons of JSON (de)serializers with all their weirdo bugs that I never would've thought of. There are probably pretty good test suites now though? I've never looked.
> I wrote an MP implementation specifically for this use case
Perhaps I missed that, can't recall. Will definitely try (again) tho, looks very promising.
As for parsing JSON, the upside is that's its trivial to debug over serial, both viewing and sending, and in my case I could assume limited shenanigans and fail hard if there were issues.
1 reply →
> parts of the spec that are basically unworkable like indefinite length values,
Is this really a problem in practice? Say, an HTTP/1.1 message also may have the body of indefinite length, and it usually works just fine.
No in practice people just ignore the spec, but that's not really what you're hoping for when writing one.
Is it? Take JSON: its spec states that JSON numbers are theoretically infinite precision rationals, but implementations are free to impose their own restrictions. And they do: Python, for instance, is perfectly happy with both serializing and deserializing e.g. 2*7000 while many other implementations (e.g. Golang's) would balk at such values. Still, it works out mostly fine in practice. Is CBOR really worse than JSON?
3 replies →