Comment by maxdamantus

1 year ago

I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].

Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).

There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).

Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.

This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.

I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.

[0] https://github.com/Maxdamantus/jq/blob/911d01aaa5bd33137fadf...

[1] https://github.com/jqlang/jq/pull/2314

[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.

2 comments

maxdamantus

mmastrac 1 year ago

That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".

maxdamantus 1 year ago
In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
enum DecodeResult { Ok(char); ErrUtf8(u8); // 0x80..0xFF ErrUtf16(u16); // 0xD800..0xDFFF }