Comment by gritzko

5 months ago

I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...

UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.

5 comments

gritzko

sedatk 5 months ago

> For example, how do you handle UTF-8 encoded surrogate pairs?

Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).

gritzko 5 months ago
In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html
- sedatk 5 months ago
  
  I mean hopefully not, but the linked example is about JSON parsing, not UTF-8.
  
  1 reply →

cryptonector 5 months ago

> Unicode per se is a dumpster fire

Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.