← Back to context

Comment by Rendello

6 months ago

I generated this list from a Python script I wrote a few months back for use in property tests in a Rust codebase. Its meant to break parsers that make bad assumptions about UTF-8, like assuming that upper- or lowercasing a character will always result in the character encoding having the same size in bytes, or even that it will result in one character.

The "ff" ligature, for example, is uppercased by Python as "FF", meaning it both becomes two separate characters, and is one byte smaller overall. I hope it's interesting.

Thanks for the insight. I had never considered this even though i researched quite some oddities in UTF-8 parsing myself over the years. It's the gift that keeps on giving when it comes to ways to breaking things in software, i find. Time to go over my code again.

The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?

Not sure I wanted to know...

  • A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.

    • It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

      It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

      9 replies →

    • So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?

      I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.

      5 replies →

  • I know Halloween was yesterday but let's discover this horror together with some terrifying Python[1]! Turns out, yep.

    For upper → lower → upper we have:

    Ω ω Ω

    İ i̇ İ

    K k K

    Å å Å

    ẞ ß SS

    ϴ θ Θ

    For lower → upper → lower there are a lot more.

    1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

    • Lowercasing the symbols for Ohm, Kelvin or Ångström makes no sense.

      For the Greek alphabet cases, isn't there canonical forms for this kind of stuff?

      3 replies →

    • This is really cool! Thanks a lot for the effort!

      But yeah, I got the idea from GP's "ff" example, but I'm kinda shocked there are so many.

  • Indeed, the parent already gives one: flip_case(flip_case("ff")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ff' ligature and the second is two 'f's.)

  • > Not sure I wanted to know...

    Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.

    Bruce Schneier in 2000 on Unicode security risks:

    https://www.schneier.com/crypto-gram/archives/2000/0715.html...

    Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.

    My favorite line from Schneier: "Unicode is just too complex to ever be secure".

    And, no matter if you love Unicode or not, there's lots of wisdom in there.

    When design-by-committee gives birth to something way too complex, insecurity is never far behind.

    • If you tried to come up with a “lightweight” Unicode alternative it would almost certainly evolve right back into the clusterfuck that Unicode is. In fact the odds would mean it would probably be even worse.

      Unicode is complex because capturing all the worlds writing systems into a single system is categorically complex. Because human meatspace language is complex.

      And even then if you decided to “rewrite the worlds language systems themselves” to conform to a simpler system it too would eventually evolve right back into the clusterfuck that is the worlds languages.

      It’s inescapable. You cannot possibly corral however many billion people live on this planet into something less complex. Humans are too complex and the ideas and emotions they need to express are too complex.

      The fact that Unicode does as good of a job as it does and has stuck around for so long is a pretty big testament to how well designed and versatile it is! What came before it was at least an order of magnitude worse and whatever replaces it will have to be several orders of magnitude better.

      Whatever drives a Unicode replacement would have to demonstrate a huge upset to how we do things… like having to communicate with intelligent life on other planets or something and even then they probably have just as big of a cluster fuck as Unicode to represent whatever their writing system is. And even then Unicode might be able to support it!

    • > it's just that Unicode is a total and complete clusterfuck

      [...]

      > When design-by-committee gives birth to something way too complex, insecurity is never far behind.

      Human writing is (and has historically been) a "clusterfuck". Any system that's designed to encode every single known human writing system is bound to be way too complex.

      I almost always side with blaming systems that are too complex or insecure by design as opposed to blaming the users (the canonical example being C++), but in the case of Unicode there's no way to make a simpler system; we'll keep having problems until people stop treating Unicode text as something that works more or less like English or Western European text.

      In other words: if your code is doing input validation over an untrusted Unicode string in the year of our Lord 2024, no one is to blame but yourself.

      (That's not to say the Unicode committee didn't make some blunders along the way -- for instance the Han unification was heavily criticized -- but those have nothing to do with the problems described by Schneier).

    • How could you ever make it simple given that the problem domain itself is complex as fuck? Should we all just have stuck with code pages and proprietary character encodings? Or just have people unable to use their own languages? Or even to spell their own names? It’s easy for a culturally blind English speaker to complain that text should be simple, must be due to design by committee that it isn’t!

    • Unicode is worse than design-by-committee. It's a design-by-committee attempt to represent several hundred design-by-culture systems in one unified whole. Desgin-by-culture is even messier than design-by-committee, since everyone in the culture contributes to the design and there's never a formal specification, you just have to observe how something is used!

    • Could you try an argument that unicode is insecure compared to roll-your-own support for the necessary scripts? You may consider "necessary" to mean "the ones used in countries where at least two of Microsoft, Apple and Sun sold localised OSes".