Comment by Mikhail_Edoshin

13 hours ago

Here is what an UTF-8 decoder needs to handle:

1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.

3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.

4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

1 comment

Mikhail_Edoshin