RFC 7049 - Concise Binary Object Representation (CBOR)

12 years ago (tools.ietf.org)

52 comments

yorhel

RFCs don't amount to much without adoption. The RFC database is full of protocols with grand designs and seemingly broad applicability. Look at the "Extensible Provisioning Protocol", EPP - http://tools.ietf.org/html/rfc5730 - a protocol "for the provisioning and management of objects stored in a shared central repository." - it reads as a marvelously generic protocol for client-managed key-value data storage - maybe it's suited for caching systems, or cloud BLOB storage, or as an abstraction of dropbox... but in reality it's just the protocol used by internet domain registrars to manage domain name registrations on a registry server - the nichiest of niche applications, yet the subject of a dozen RFCs. It's not going to be picked up and supported by Hadoop or Dropbox or anybody else who needs client managed obect storage, they're going to stick to HTTP REST.

This CBOR format is being proposed by the VPN Consortium - presumably there's some specific VPN interoperability application they have in mind for this. In the meantime, everybody else will continue to use compressed JSON, or protocol buffers, or whatever other standards have good library support and interoperability and - crucially - adoption in their domain.

na85 12 years ago

I agree with all the points you've made, and I haven't read this RFC beyond a quick skim, but consider:
-a lot of the time, a dearth of implementations of a new Thing is not because the new Thing is bad, but simply because people are change-averse and lazy, even in the face of an objectively better Thing, and
-I still consider this a quality submission; even if CBOR doesn't get adopted it's still neat to read. It's like watching one's government draft new legislation, except more relevant.

ghoul2 12 years ago

There is one significant problem I see:

the length field for compound types (arrays and maps) specify the length in "the number of items", not in bytes. This means while processing, If I need to skip a compound type, I actually need to process it in its entirety. Not very "small device" friendly.

In practice, I have found far more utility in knowing the byte-length of a compound field in advance than the number of items it contains. If I am interested in the field, I am anyway going to find out the number of items cause I am going to process it. If I am not interested in the field, the number of items are useless to me, but the byte-length would have come in handy.

mwcremer 12 years ago
I think the thinking here is that the sender may not be able to compute the byte size of the object a priori. Think HTTP chunked encoding.
- ghoul2 12 years ago
  
  I understand that is a concern in many situations. The problem here though is that you don't get the "streaming" benefits in any case: you still have to include the length-in-number-of-items of the compound type and the lengths of each individual member item in any case.
angersock 12 years ago
I'm a bit concerned about the "indefinite length" stuff for arrays, buffers, and strings.
That seems like something that's going to come back and byte us.
- Arelius 12 years ago
  
  Because we can store enough things to fill memory?
  
  3 replies →
slavio 12 years ago
I agree. I think CBOR trades off a bit too much efficiency of in-place data access for compactness of representation.
- simias 12 years ago
  
  Isn't the whole point of binary serialization formats efficiency and ease of parsing? Otherwise you might as well use .json.gz and probably end up with smaller files anyway.
  
  1 reply →
jokoon 12 years ago

Well just write the file type to deduce the size then.
I'm not sure I understand the problem you describe, really.
Even if there are string, just encode their lengths, or if you store a compound type, write the size when size can vary.

ctz 12 years ago

Avoiding the need for protocol version negotiation might be a useful feature in some systems, but it seems to me that the things you lose makes it really not worth it. Particularly, a protocol without atoms invariably ends up like most JSON APIs -- very 'stringly typed', somewhat poorly defined, and verbose on the wire.

Which is strange for a thing calling itself 'concise'.

craftit 12 years ago
It does seem an odd trade off. Having key value pairs is great for prototyping and the keys make it easier for people to interpret the messages and to write code to use them. On the other hand repeatedly sending readable key values seems a huge waste. I guess when streaming you could send a header with a map in it, but it then makes things complicated....
- eli 12 years ago
  
  Could something like GZIP mask a lot of that repetition?
  
  1 reply →

huhtenberg 12 years ago

Serialization formats are like indentation styles. Dead easy to pick or invent one, nearly impossible to convince others to switch to it.

albertzeyer 12 years ago

Yes! For that reason, I also have my own: https://github.com/albertz/binstruct
6ren 12 years ago

The significant thing about XML is not that it's any good, but that everyone switched to it. I predicted this meant XML would win forever... but JSON does seem to be catching up.

stevecooperorg 12 years ago

Anyone care to comment on where we might use such a thing? Is it already in use? And does it compare favourably with BSON?

moondowner 12 years ago

In Appendix E in the spec, named "Comparison of Other Binary Formats to CBOR's Design Objectives" there are several comparisons - including BSON:

   [BSON] is a data format that was developed for the storage of JSON-
   like maps (JSON objects) in the MongoDB database.  Its major
   distinguishing feature is the capability for in-place update,
   foregoing a compact representation.  BSON uses a counted
   representation except for map keys, which are null-byte terminated.
   While BSON can be used for the representation of JSON-like objects on
   the wire, its specification is dominated by the requirements of the
   database application and has become somewhat baroque.  The status of
   how BSON extensions will be implemented remains unclear.

lucian1900 12 years ago
It looks closer to msgpack if anything, but with actual strings and bytes.
- memracom 12 years ago
  
  It is inspired by msgpack. If you read the RFC then you will see that the authors like msgpack but have some different requirements.

kabdib 12 years ago

This gets a surprising number of things right. I've worked on a couple of these. In particular I'm delighted to see both the definite and indefinite streams of things.

I'm a little bit tired (well, more than a little tired) of standards that aren't couched in terms that are directly executable. English descriptions and psuedo-code are fine, but in the end I want to have some working code that implements an API for the stuff. Doesn't have to be an official API, but something usable shows me that (a) it is indeed usable, and (b) will go a long way towards heading off other people's mistakes.

We don't do crypto without test vectors. I don't know why we think we can do other complex standards without test vectors, either. (I worked on NBS / NIST in the 70s on some verification suites. Have we lost that practice?)

I think that much of what is busted on the modern web can be traced back to loose english and lack of reference code (even stuff with placeholders). CSS, HTML, etc., I'm looking at you... :-/

tveita 12 years ago
> In particular I'm delighted to see both the definite and indefinite streams of things.
Why? I can see the advantages of either one, but I don't see what having both gets you.
In my experience the implementation advantages of having length-prefixed lists disappear if you have to support indefinite lengths anyway.
- kabdib 12 years ago
  
  I want to use the same data structures for
  - Passing small messages around
  - Doing streaming of large content (occasionally)
  I'm probably doing these over different pipes, but the data shares a lot of the same characteristics and I don't want to use two totally different APIs to get the job done.
  "Large" can be "I need to transfer something on the order of megabytes using a 4K intermediate buffer."

roncohen 12 years ago

They should do the world a favor and include a datetime type.

samatman 12 years ago

Indeed, this is a known terrible mistake, easily avoided.
The lack of the string "UUID" in the RFC is also cause for concern.
MichaelGG 12 years ago

They have a tag for date strings, or you can use seconds from epoch, as an integer _or floating-point_. So if you actually want to represent time with proper fractional seconds, you're stuck representing them as strings. Hardly concise.
Someone 12 years ago
Whats terribly wrong with http://tools.ietf.org/html/rfc7049#section-2.4.1?
- drdaeman 12 years ago
  
  The minor issues are missing timezone and precision information.
  But, most importantly, use of integers for datetime values hides type-level semantics. It's just integers and you, the end user, and not the deserializer, is responsible for handling the types.
  I think it's quite inconvenient to do tons of `data["since"] = parse_datetime(data["since"])` all the time, for every model out there.
  
  1 reply →
jessaustin 12 years ago

Amen! How is this not an obvious step by this point?

craftit 12 years ago

Looks a good spec, great as a way of sending data to 'Internet Of Things' style devices where processing power and possibly bandwidth are issues.

MichaelGG 12 years ago

Can anyone enlighten me on why number equivalency is a good idea? The spec says that even if you're expecting an integer like 0, encoders can decide to use floating point, and things should just work. One of the first statements is that "7" should be able to be represented in multiple ways. That doesn't seem concise.

arh68 12 years ago
Hmm, that's not the impression I got. I don't think they're arguing you should use multiple encodings willy nilly. Rather, they're avoiding the limit of exactly 1 encoding for every input (maximum flexibility in the spec).
Of course, in real-world implementations, the encoder and the decoder will have a shared view of what should be in a CBOR data item. For example, an agreed-to format might be "the item is an array whose first value is a UTF-8 string, second value is an integer, and subsequent values are zero or more floating-point numbers" or "the item is a map that has byte strings for keys and contains at least one pair whose key is 0xab01".
7 is 7 whether it's uint_8 or uint_32, right?
- arh68 12 years ago
  
  Also, there's actually a much more relevant section later on in the spec (just got to p18):
  For constrained applications, where there is a choice between representing a specific number as an integer and as a decimal fraction or bigfloat (such as when the exponent is small and non-negative), there is a quality-of- implementation expectation that the integer representation is used directly.

thomseddon 12 years ago

I would really love to see a convergence of such binary formats, I hate that choosing between Google's Protocol Buffers, Apache (Facebook) Thrift etc. forces you down a very specific path of non-interoperable libraries.

I would like to see how this compares to other formats with respect to serialised size...

slavio 12 years ago

Any JSON object encoding format would greatly benefit from compression. It does not have to be complicated: even something as simple as using a dictionary array of "symbols" whose indices can be used instead of repeating string values.

LeafStorm 12 years ago

This looks like a fairly well-designed format. My main concern is that this seems to have suddenly appeared out of nowhere and gone directly to RFC. (Presumably there was an Internet-Draft, but I have never seen anything about this before.)

ape4 12 years ago
These kind of binary formats always have vulnerabilities. eg http://technet.microsoft.com/en-us/security/bulletin/ms04-00...
- AsymetricCom 12 years ago
  
  It would be up to the parser to implement the standard without a vulnerability, but a protocol is a language and a language can be designed to be self-referential, hypocritical, inconstant etc, making a conforming parser impossible. A lot of these so called "living standards" are probably not "evolving" so much as partially classified.

430gj9j 12 years ago

In the examples, 0x3bffffffffffffffff decodes to -18446744073709551616, which doesn't fit into int64_t. Why didn't they switch to bignums after INT64_MIN (-9223372036854775808) instead? Seems a bit asymmetric.

YesThatTom2 12 years ago

I was really starting to like capnproto.com

mbq 12 years ago

At least it doesn't copy the JSON's braindead idea to rule out NaNs and Infs...

angersock 12 years ago
Wait, what? I hadn't heard about that. Whatfuck?
- mbq 12 years ago
  
  See: http://stackoverflow.com/questions/1423081/json-left-out-inf...
  In short, this is because JS does not treat NaN and Infinity as numerical constants but as pre-defined, mutable variables; this way backward-compatible parsing of hypothetical sane JSON with eval would be vulnerable to injection. Nevertheless, many JSON codecs have their own idea what to do with it, so this stuff can get really nasty.

otikik 12 years ago

I like it. Fighting the urge to write a parser for it in my language of choice.

michaelmior 12 years ago
I couldn't fight it off https://github.com/michaelmior/pycobr
Just got the encoder so far (without major type 6, i.e. tagging) and the code is pretty messy and possibly not 100% correct, but it's true that the amount of code required is pretty minimal.
- michaelmior 12 years ago
  
  Update: Fixed a bunch of bugs in the encoder and have a working decoder as well. Still no tagging, but you can encode/decode pretty much anything you could with a naive JSON implementation.