Comment by xg15

6 months ago

The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?

Not sure I wanted to know...

A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.

  • It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

    It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

    • But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.

      1 reply →

    • This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.

      Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.

      There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.

  • So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?

    I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.

I know Halloween was yesterday but let's discover this horror together with some terrifying Python[1]! Turns out, yep.

For upper → lower → upper we have:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

For lower → upper → lower there are a lot more.

1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

  • This is really cool! Thanks a lot for the effort!

    But yeah, I got the idea from GP's "ff" example, but I'm kinda shocked there are so many.

Indeed, the parent already gives one: flip_case(flip_case("ff")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ff' ligature and the second is two 'f's.)

> Not sure I wanted to know...

Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.

Bruce Schneier in 2000 on Unicode security risks:

https://www.schneier.com/crypto-gram/archives/2000/0715.html...

Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.

My favorite line from Schneier: "Unicode is just too complex to ever be secure".

And, no matter if you love Unicode or not, there's lots of wisdom in there.

When design-by-committee gives birth to something way too complex, insecurity is never far behind.

  • If you tried to come up with a “lightweight” Unicode alternative it would almost certainly evolve right back into the clusterfuck that Unicode is. In fact the odds would mean it would probably be even worse.

    Unicode is complex because capturing all the worlds writing systems into a single system is categorically complex. Because human meatspace language is complex.

    And even then if you decided to “rewrite the worlds language systems themselves” to conform to a simpler system it too would eventually evolve right back into the clusterfuck that is the worlds languages.

    It’s inescapable. You cannot possibly corral however many billion people live on this planet into something less complex. Humans are too complex and the ideas and emotions they need to express are too complex.

    The fact that Unicode does as good of a job as it does and has stuck around for so long is a pretty big testament to how well designed and versatile it is! What came before it was at least an order of magnitude worse and whatever replaces it will have to be several orders of magnitude better.

    Whatever drives a Unicode replacement would have to demonstrate a huge upset to how we do things… like having to communicate with intelligent life on other planets or something and even then they probably have just as big of a cluster fuck as Unicode to represent whatever their writing system is. And even then Unicode might be able to support it!

  • > it's just that Unicode is a total and complete clusterfuck

    [...]

    > When design-by-committee gives birth to something way too complex, insecurity is never far behind.

    Human writing is (and has historically been) a "clusterfuck". Any system that's designed to encode every single known human writing system is bound to be way too complex.

    I almost always side with blaming systems that are too complex or insecure by design as opposed to blaming the users (the canonical example being C++), but in the case of Unicode there's no way to make a simpler system; we'll keep having problems until people stop treating Unicode text as something that works more or less like English or Western European text.

    In other words: if your code is doing input validation over an untrusted Unicode string in the year of our Lord 2024, no one is to blame but yourself.

    (That's not to say the Unicode committee didn't make some blunders along the way -- for instance the Han unification was heavily criticized -- but those have nothing to do with the problems described by Schneier).

  • How could you ever make it simple given that the problem domain itself is complex as fuck? Should we all just have stuck with code pages and proprietary character encodings? Or just have people unable to use their own languages? Or even to spell their own names? It’s easy for a culturally blind English speaker to complain that text should be simple, must be due to design by committee that it isn’t!

  • Unicode is worse than design-by-committee. It's a design-by-committee attempt to represent several hundred design-by-culture systems in one unified whole. Desgin-by-culture is even messier than design-by-committee, since everyone in the culture contributes to the design and there's never a formal specification, you just have to observe how something is used!

  • Could you try an argument that unicode is insecure compared to roll-your-own support for the necessary scripts? You may consider "necessary" to mean "the ones used in countries where at least two of Microsoft, Apple and Sun sold localised OSes".