← Back to context

Comment by camgunz

5 months ago

> I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?

That's fair; I've been a little confusing when I say things like "CBOR is MessagePack". To be clear I mean that CBOR's format is fundamentally MessagePack's, and my issues are with the stuff added beyond that.

> But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way.

Totally! I think MP's extension types (CBOR's "tags") are pretty perfect for this. I mean, bignums or datetimes or whatever are often useful, and supporting extension types/tags in an implementation is really straightforward. I just don't think stuff like this belongs in a data representation format. There's a reason JSON, gRPC, Cap'n Proto, Thrift, etc. don't even support datetimes.

> Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property.

This is the example I always have in my head too. But canonicalization puts some significant requirements on a serializer. Like, when do you enable canonicalization? CBOR limits the feature set when canonicalizing, so you can do it up front and error if someone tries to add an indefinite-length type, or you can do it at the end and error then, and this by itself is a big question. You also have to recursively descend through any potentially nested type and canonicalize it. What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].

But alright, you can make a reasonable library even aside from all this stuff. But are you really just trusting that things are canonicalized receiver side? Definitely not, so you do a lot of validation on your own which pretty much obviates any advantage you might get. JWT is a great use-case of people assuming the JSON was well-formed: canonicalized or not, you gotta validate. You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.

> To me it looks more like Markdown vs. CommonMark disputes

There was some of this because of the bytes vs. strings debate. Basically people were like, "hey wait, when I deserialize in a dynamic language that assumes strings are UTF-8, I get raw byte strings? I don't like that", but on the other hand Treasure Data (MP creators) had lots of data already stored in the existing format so they needed (well, wanted anyway) a solution that was backwards compatible, plus you want to consider languages that don't really know about things like UTF-8 or use something else internally (C/C++, Java, Python for a while). That's where MPv5 came from, and the solution is really elegant. If CBOR was MPv4 + strings I'd 100% agree with you, but it's a kitchen sink of stuff that's pretty poorly thought out. You can see this in the diversity of support in CBOR implementations. I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2]. Go's is really comprehensive [3] but the lengths it has to go to (internal buffer pools, etc) are pretty bananas and beyond what you'd expect for a data representation format, plus it has knobs like disabling indefinite-length encodings, limiting the sizes of them, limiting the stack depth for nesting, and so on, again really easy to get into trouble here.

[0]: https://datatracker.ietf.org/doc/html/rfc8949#name-specifyin...

[1]: https://datatracker.ietf.org/doc/html/rfc8949#name-serializa...

[2]: https://github.com/enarx/ciborium/issues/144

[3]: https://github.com/fxamacker/cbor

> What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].

I agree on this point, DAG-CBOR in my knowledge is defined to avoid such pitfall. Again, we can agree that Bormann is not a good spec writer nor a good communicator regardless of his design skill.

> You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.

> I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2].

However this argument is... absurd to be frank. Canonicalization is an additional stuff and not every implementation is going to implement that. More specifically, I'm only leaning on the fact that there is a single defined canonicalization scheme that can be leveraged by any interested user, not that it is mandatory (say, unlike bencode) because canonicalization and other stuffs naturally require different API designs anyway.

Let's think about a concrete case of sorted keys in maps. Most implementations are expected to return a standard mapping type for them because that's natural to do so. But many if not most mapping types are not sorted by keys. (Python is a rare counterexample AFAIK, but its decision to order keys by default was motivated by the exact point I'm about to say.) So you have to shift the burden of verification to the implementation, or you need an ordered key iterator API which will remain a niche. We seem to agree that the canonicalization itself has to be done somewhere, but we ended up with an implementation burden wherever we put the verification step. So this is not a good argument against format-standardized canonicalization at all.

  • I don't think canonicalization is really important in the world of data serialization formats (ex: Protocol Buffers doesn't do it and things seem fine). If you're defining something you're--for example--gonna HMAC, canonicalization is overkill because a data serialization format is overkill. The problem w/ JWT wasn't that JSON didn't have canonicalization (I think this is true?) at the time, the problem is that it used JSON at all. There was no real reason to do this, especially when everyone uses a JWT library anyway: the underlying format could have been anything (and newer token formats have learned this lesson).