Comment by sevensor
4 days ago
That’s my point, though! I’ve run into popular JSON libraries that will emit all of those! 9007199254740993 is problematic because it’s not representable as a 64 bit float. Python’s JSON library is happy to write it, even though you need an int to represent it, and JSON doesn’t have ints.
Edit: I didn’t see my thought all the way through here. Syntax typing invites this kind of nonconformity, because different programming languages mean different things by “number,” “string,” “date,” or even “null.” They will bend the format to match their own semantics, resulting in incompatibility.
> 9007199254740993 is problematic because it’s not representable as a 64 bit float. Python’s JSON library is happy to write it, even though you need an int to represent it
JSON numbers have unlimited range in terms of the format standard, but implementations are explicitly permitted to set limits on the range and precision they generate and handle, and users are warned that:
Also, you don't need an int to represent it (a wide enough int will represent it, so will unlimited precision decimals, wide enough binary floats -- of standard formats, IEEE 754 binary128 works -- etc.).
RFC 8259 is a good read and I wish more people would make the effort. I really don’t mean to bash JSON here. It was a great idea and it continues to be a great idea, especially if you are using javascript. However, the passage you quote illustrates the same shortcoming I’m complaining about: RFC 8259 basically says “valid primitive types in json are the valid primitive types in your programming language,” but this results in implementations like Python’s json library emitting invalid tokens like bare NaN, which can cause decoders to choke.
I think what JSON gets right is that it gives us a universal way of expressing structure: arrays and objects map onto basic notions of sequence and association that are useful in many contexts and can be represented in a variety of ways by programming languages. My ideal data interchange format would stop there and let the user decide what to do with the value text after the structure has been decoded.
Before your edit, I was going to object to your premise because it seems like a format could get worse just by more implementations being made.
After your edit, I see that it's rather that syntax-typed formats are prone to this form of implementation divergence.
I don't think this is limited to syntax-typed formats though. For example, TNetstrings[1] have type tags, but "#" is an integer. The specification requires that integers fit into 63 bits (since the reference encoder will refuse to encode a python long), but implementations in C tend to allow 64 bits and in other languages allow bignums. It does explicitly allow "nan", "inf", and "-inf" FWIW.
1: https://tnetstrings.info/
Agreed; I think there’s a problem with self-describing data as a concept. It just begs for implementation defined weirdness.