← Back to context

Comment by kccqzy

15 hours ago

Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.

Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.

They can happen to _be_ Unicode. Composition operations (for fully terminated Unicode strings) should work, but require eventual normalization.

That's the other part of the resume UTF8 strings mid way, even combining broken strings still results in all the good characters present.

Substring operations are more dicey; those should be operating with known strings. In pathological cases they might operate against portions of Unicode bits... but that's as silly as using raw pointers and directly mangling the bytes without any protection or design plans.