Comment by jridgewell

5 months ago

This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.

2 comments

jridgewell

rockwotj 5 months ago

> outputs 3 chars

You mean codepoints or maybe grapheme clusters?

Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful

jridgewell 5 months ago

Yah, I was using char interchangeably with code point. I also used byte instead of code unit.
> truncate a string without splitting a codepoint in O(1) is still useful
Agreed!