Comment by syncsynchalt
6 hours ago
I've written some Unicode transcoders; UTF-8 decoding devolves to a quartet of switch statements and each of the issues you've mentioned end up being a case statement where the solution is to replace the offending sequence with U+FFFD.
UTF-16 is simple as well but you still need code to absorb BOMs, perform endian detection heuristically if there's no BOM, and check surrogate ordering (and emit a U+FFFD when an illegal pair is found).
I don't think there's an argument for either being complex, the UTFs are meant to be as simple and algorithmic as possible. -8 has to deal with invalid sequences, -16 has to deal with byte ordering, other than that it's bit shifting akin to base64. Normalization is much worse by comparison.
My preference for UTF-8 isn't one of code complexity, I just like that all my 70's-era text processing tools continue working without too many surprises. The features like self-synchronization are nice too compared to what we _could_ have gotten as UTF-8.
No comments yet
Contribute on Hacker News ↗