← Back to context

Comment by timbray

1 year ago

Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.

Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.

The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:

  unicode-assignable =
     %x9 / %xA / %xD /               ; useful controls
     %x20-7E /                       ; exclude C1 controls and DEL
     %xA0-D7FF /                     ; exclude surrogates
     %xE000-FDCF /                   ; exclude FDD0 nonchars
     %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
     %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
     %x30000-3FFFD / %x40000-4FFFD /
     %x50000-5FFFD / %x60000-6FFFD /
     %x70000-7FFFD / %x80000-8FFFD /
     %x90000-9FFFD / %xA0000-AFFFD /
     %xB0000-BFFFD / %xC0000-CFFFD /
     %xD0000-DFFFD / %xE0000-EFFFD /
     %xF0000-FFFFD / %x100000-10FFFD

This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.