Comment by otabdeveloper4

1 hour ago

Unicode could have just been encoded statefuly with a "current code page" mark byte.

With UTF and emojis we can't have random access to characters anyways, so why not go the whole way?

3 comments

otabdeveloper4

Yikes. That would lose the ability to know the meaning of the current bytes, or misinterpret them badly, if you happen to get one critical byte dropped or mangled in transmission. At least UTF-8 is self-syncing: if you end up starting to read in the middle of a non-rewindable stream whose beginning has already passed, you can identify the start of the next valid codepoint sequence unambiguously, and then end up being able to sync up with the stream, and you're guaranteed not to have to read more than 4 bytes (6 bytes when UTF-8 was originally designed) in order to find a sync point.

But if you have to rely on a byte that may have already gone past? No way to pick up in the middle of a stream and know what went before.

yencabulator 1 hour ago

A huge, central, part of UTF-8 design is that you can start decoding it from any arbitrary offset, it is self-aligning.