RFC 7049 - Concise Binary Object Representation (CBOR)

12 years ago (tools.ietf.org)

RFCs don't amount to much without adoption. The RFC database is full of protocols with grand designs and seemingly broad applicability. Look at the "Extensible Provisioning Protocol", EPP - http://tools.ietf.org/html/rfc5730 - a protocol "for the provisioning and management of objects stored in a shared central repository." - it reads as a marvelously generic protocol for client-managed key-value data storage - maybe it's suited for caching systems, or cloud BLOB storage, or as an abstraction of dropbox... but in reality it's just the protocol used by internet domain registrars to manage domain name registrations on a registry server - the nichiest of niche applications, yet the subject of a dozen RFCs. It's not going to be picked up and supported by Hadoop or Dropbox or anybody else who needs client managed obect storage, they're going to stick to HTTP REST.

This CBOR format is being proposed by the VPN Consortium - presumably there's some specific VPN interoperability application they have in mind for this. In the meantime, everybody else will continue to use compressed JSON, or protocol buffers, or whatever other standards have good library support and interoperability and - crucially - adoption in their domain.

  • I agree with all the points you've made, and I haven't read this RFC beyond a quick skim, but consider:

    -a lot of the time, a dearth of implementations of a new Thing is not because the new Thing is bad, but simply because people are change-averse and lazy, even in the face of an objectively better Thing, and

    -I still consider this a quality submission; even if CBOR doesn't get adopted it's still neat to read. It's like watching one's government draft new legislation, except more relevant.

There is one significant problem I see:

the length field for compound types (arrays and maps) specify the length in "the number of items", not in bytes. This means while processing, If I need to skip a compound type, I actually need to process it in its entirety. Not very "small device" friendly.

In practice, I have found far more utility in knowing the byte-length of a compound field in advance than the number of items it contains. If I am interested in the field, I am anyway going to find out the number of items cause I am going to process it. If I am not interested in the field, the number of items are useless to me, but the byte-length would have come in handy.

  • I think the thinking here is that the sender may not be able to compute the byte size of the object a priori. Think HTTP chunked encoding.

    • I understand that is a concern in many situations. The problem here though is that you don't get the "streaming" benefits in any case: you still have to include the length-in-number-of-items of the compound type and the lengths of each individual member item in any case.

  • I agree. I think CBOR trades off a bit too much efficiency of in-place data access for compactness of representation.

    • Isn't the whole point of binary serialization formats efficiency and ease of parsing? Otherwise you might as well use .json.gz and probably end up with smaller files anyway.

      1 reply →

  • Well just write the file type to deduce the size then.

    I'm not sure I understand the problem you describe, really.

    Even if there are string, just encode their lengths, or if you store a compound type, write the size when size can vary.

Avoiding the need for protocol version negotiation might be a useful feature in some systems, but it seems to me that the things you lose makes it really not worth it. Particularly, a protocol without atoms invariably ends up like most JSON APIs -- very 'stringly typed', somewhat poorly defined, and verbose on the wire.

Which is strange for a thing calling itself 'concise'.

  • It does seem an odd trade off. Having key value pairs is great for prototyping and the keys make it easier for people to interpret the messages and to write code to use them. On the other hand repeatedly sending readable key values seems a huge waste. I guess when streaming you could send a header with a map in it, but it then makes things complicated....

Anyone care to comment on where we might use such a thing? Is it already in use? And does it compare favourably with BSON?

  • In Appendix E in the spec, named "Comparison of Other Binary Formats to CBOR's Design Objectives" there are several comparisons - including BSON:

       [BSON] is a data format that was developed for the storage of JSON-
       like maps (JSON objects) in the MongoDB database.  Its major
       distinguishing feature is the capability for in-place update,
       foregoing a compact representation.  BSON uses a counted
       representation except for map keys, which are null-byte terminated.
       While BSON can be used for the representation of JSON-like objects on
       the wire, its specification is dominated by the requirements of the
       database application and has become somewhat baroque.  The status of
       how BSON extensions will be implemented remains unclear.

  • It looks closer to msgpack if anything, but with actual strings and bytes.

    • It is inspired by msgpack. If you read the RFC then you will see that the authors like msgpack but have some different requirements.

This gets a surprising number of things right. I've worked on a couple of these. In particular I'm delighted to see both the definite and indefinite streams of things.

I'm a little bit tired (well, more than a little tired) of standards that aren't couched in terms that are directly executable. English descriptions and psuedo-code are fine, but in the end I want to have some working code that implements an API for the stuff. Doesn't have to be an official API, but something usable shows me that (a) it is indeed usable, and (b) will go a long way towards heading off other people's mistakes.

We don't do crypto without test vectors. I don't know why we think we can do other complex standards without test vectors, either. (I worked on NBS / NIST in the 70s on some verification suites. Have we lost that practice?)

I think that much of what is busted on the modern web can be traced back to loose english and lack of reference code (even stuff with placeholders). CSS, HTML, etc., I'm looking at you... :-/

  • > In particular I'm delighted to see both the definite and indefinite streams of things.

    Why? I can see the advantages of either one, but I don't see what having both gets you.

    In my experience the implementation advantages of having length-prefixed lists disappear if you have to support indefinite lengths anyway.

    • I want to use the same data structures for

      - Passing small messages around

      - Doing streaming of large content (occasionally)

      I'm probably doing these over different pipes, but the data shares a lot of the same characteristics and I don't want to use two totally different APIs to get the job done.

      "Large" can be "I need to transfer something on the order of megabytes using a 4K intermediate buffer."

They should do the world a favor and include a datetime type.

  • Indeed, this is a known terrible mistake, easily avoided.

    The lack of the string "UUID" in the RFC is also cause for concern.

  • They have a tag for date strings, or you can use seconds from epoch, as an integer _or floating-point_. So if you actually want to represent time with proper fractional seconds, you're stuck representing them as strings. Hardly concise.

  • Whats terribly wrong with http://tools.ietf.org/html/rfc7049#section-2.4.1?

    • The minor issues are missing timezone and precision information.

      But, most importantly, use of integers for datetime values hides type-level semantics. It's just integers and you, the end user, and not the deserializer, is responsible for handling the types.

      I think it's quite inconvenient to do tons of `data["since"] = parse_datetime(data["since"])` all the time, for every model out there.

      1 reply →

Looks a good spec, great as a way of sending data to 'Internet Of Things' style devices where processing power and possibly bandwidth are issues.

Can anyone enlighten me on why number equivalency is a good idea? The spec says that even if you're expecting an integer like 0, encoders can decide to use floating point, and things should just work. One of the first statements is that "7" should be able to be represented in multiple ways. That doesn't seem concise.

  • Hmm, that's not the impression I got. I don't think they're arguing you should use multiple encodings willy nilly. Rather, they're avoiding the limit of exactly 1 encoding for every input (maximum flexibility in the spec).

    Of course, in real-world implementations, the encoder and the decoder will have a shared view of what should be in a CBOR data item. For example, an agreed-to format might be "the item is an array whose first value is a UTF-8 string, second value is an integer, and subsequent values are zero or more floating-point numbers" or "the item is a map that has byte strings for keys and contains at least one pair whose key is 0xab01".

    7 is 7 whether it's uint_8 or uint_32, right?

    • Also, there's actually a much more relevant section later on in the spec (just got to p18):

         For constrained
         applications, where there is a choice between representing a specific
         number as an integer and as a decimal fraction or bigfloat (such as
         when the exponent is small and non-negative), there is a quality-of-
         implementation expectation that the integer representation is used
         directly.

I would really love to see a convergence of such binary formats, I hate that choosing between Google's Protocol Buffers, Apache (Facebook) Thrift etc. forces you down a very specific path of non-interoperable libraries.

I would like to see how this compares to other formats with respect to serialised size...

Any JSON object encoding format would greatly benefit from compression. It does not have to be complicated: even something as simple as using a dictionary array of "symbols" whose indices can be used instead of repeating string values.

This looks like a fairly well-designed format. My main concern is that this seems to have suddenly appeared out of nowhere and gone directly to RFC. (Presumably there was an Internet-Draft, but I have never seen anything about this before.)

In the examples, 0x3bffffffffffffffff decodes to -18446744073709551616, which doesn't fit into int64_t. Why didn't they switch to bignums after INT64_MIN (-9223372036854775808) instead? Seems a bit asymmetric.

At least it doesn't copy the JSON's braindead idea to rule out NaNs and Infs...

I like it. Fighting the urge to write a parser for it in my language of choice.

  • I couldn't fight it off https://github.com/michaelmior/pycobr

    Just got the encoder so far (without major type 6, i.e. tagging) and the code is pretty messy and possibly not 100% correct, but it's true that the amount of code required is pretty minimal.

    • Update: Fixed a bunch of bugs in the encoder and have a working decoder as well. Still no tagging, but you can encode/decode pretty much anything you could with a naive JSON implementation.