Comment by kstenerud

1 month ago

"a\u0000b" ("a" followed by a vertical tabulation control code) is also a perfectly valid and in-bounds BONJSON string. What BONJSON rejects is any invalid UTF-8 sequences, which shouldn't even be present in the data to begin with.

8 comments

kstenerud

wizzwizz4 1 month ago

You're thinking of "a\u000b". "a\u0000b" is the three-character string also written "a\x00b".

kstenerud 1 month ago

Bleh... This is why my text formats use \[10c0de] to escape unicode codepoints. Much easier for humans to parse.

esrauch 1 month ago

My example was a three character string where the second one is \u0000, which is the NUL character in the middle of the string.

The spec on the GitHub says that it is banned to include NUL under a security stance, that someone that after parse someone might do strlen and accidentally truncate to a shorter string in C.

Which I think has some premise, but its a valid string contents in JSON (and in Utf8), so it is deliberately breaking 1:1 parity with JSON parity in the name of a security hypothetical.

kstenerud 1 month ago
The spec says that implementations must disable NUL by default (as in, the default configuration must disallow). https://github.com/kstenerud/bonjson/blob/main/bonjson.md#nu...
Users can of course enable NUL in the rare cases where they need it, but I want safe defaults.
Actually, I'll make that section clearer.
- esrauch 1 month ago
  
  So I think it's a very neat format, but my feedback as a random person on the Internet is that I don't think it does uphold the claimed vision in the end of being 1:1 to JSON (the security parts, but also you do end up adding extra types too) and that's a bit of a shame compared to the top line deliverable.
  Just focusing narrowly on the \0 part to explain why I say so: the spec proposed is that implementations have to either hard ban embedded \0 or disallow by default with an opt in. So someone comes with a dataset that has it, they can get support in this case only if they configure both the serializer and parser to allow it. But if you're willing to exert that level of special case extra control, I think all of the other preexisting binary-json implementations that exist do meet the top line definition you are setting as well. For some binary-json implementation which has additional types, if someone is in full end to end control to special case, then they could just choose not to use those types too, the mere existence of extra types in the binary format is no extra "problem" for 1:1 than this choice.
  IMO the deliverable that a 1:1 mapping would give us "there is no bonjson data that won't losslessly round trip to JSON and vice versa". The benefit is when it is over all future data that you haven't seen yet, where the downside of using something that is not bijective is that you run for a long time suddenly you have data dependent failures in your system because you can't 1:1 map legal data.
  And especially with this guarantee, what will inevitably happen is some downstream handling will also take as a given that they can strlen() since they "knew" the bonjson format spec banned it, so suddenly when you have it as in-bounds data you also won't be able to trivially flip the switch, instead you are stuck with legal JSON that you can't ingest in your system without an expensive audit because the reduction from 1:1 gets entrenched as an invariant into the handling code.
  Note that my vantage point might be a bit skewed here: I work on Protobuf and this shape of ecosystem interoperability topics are top of mind for me in ways that they don't necessarily need to be for small projects, and I also recognize that "what even is legal JSON" itself is not actually completely clear, so take it all with a grain of salt (and again, I also do think it looks like a very nice encoding in general).
  
  2 replies →

gritzko 1 month ago

Did you read "Parsing JSON is a minefield"?