Comment by dbaupp
17 hours ago
UTF-8 encodes each character into a whole number of bytes (8, 16, 24, or 32 bits), and the 10 continuation marker is only at the start of the extra continuation bytes, it is just data when that pattern occurs within a byte.
You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).
No comments yet
Contribute on Hacker News ↗