← Back to context

Comment by mmastrac

1 year ago

That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".

In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.

In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:

  enum DecodeResult {
    Ok(char);
    ErrUtf8(u8); // 0x80..0xFF
    ErrUtf16(u16); // 0xD800..0xDFFF
  }