Comment by mmastrac

1 year ago

That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".

1 comment

mmastrac

maxdamantus 1 year ago

In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.

In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:

  enum DecodeResult {
    Ok(char);
    ErrUtf8(u8); // 0x80..0xFF
    ErrUtf16(u16); // 0xD800..0xDFFF
  }