Comment by xg15

1 year ago

The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?

Not sure I wanted to know...

35 comments

xg15

LegionMammal978 1 year ago

A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.

Macha 1 year ago
It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i
It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.
- WorldMaker 1 year ago
  
  But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.
  
  1 reply →
- layer8 1 year ago
  
  This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.
  Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.
  There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.
- Arnt 1 year ago
  
  When do you think that first mistake happened?
  (Pick a year, then think about why it didn't happen in that year.)
  
  5 replies →
JimDabell 1 year ago
The transliteration of this specific character was also involved in a violent attack and suicide:
https://languagelog.ldc.upenn.edu/nll/?p=73
- von_lohengramm 1 year ago
  
  Hardly fair to call it a murder-suicide. Ramazan killed Emine in self-defense.
  
  1 reply →
Wowfunhappy 1 year ago
So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?
I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.
- dgoldstein0 1 year ago
  
  My understanding is it's a bug that the case changes don't round trip correctly, in part due to questionable Unicode design that made the upper and lower case operations language dependent.
  This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.
  https://stackoverflow.com/questions/48067545/why-does-unicod...
  
  4 replies →

Rendello 1 year ago

I know Halloween was yesterday but let's discover this horror together with some terrifying Python[1]! Turns out, yep.

For upper → lower → upper we have:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

For lower → upper → lower there are a lot more.

1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

gerikson 1 year ago
Lowercasing the symbols for Ohm, Kelvin or Ångström makes no sense.
For the Greek alphabet cases, isn't there canonical forms for this kind of stuff?
- Am4TIfIsER0ppos 1 year ago
  
  Are those first two something other than Greek Omega and ascii K?
  
  2 replies →
xg15 1 year ago

This is really cool! Thanks a lot for the effort!
But yeah, I got the idea from GP's "ff" example, but I'm kinda shocked there are so many.

JadeNB 1 year ago

Indeed, the parent already gives one: flip_case(flip_case("ﬀ")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ﬀ' ligature and the second is two 'f's.)

automatic6131 1 year ago

> flip_case(flip_case(x)) == x

Falsehoods programmers believe about strings...

TacticalCoder 1 year ago

> Not sure I wanted to know...

Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.

Bruce Schneier in 2000 on Unicode security risks:

https://www.schneier.com/crypto-gram/archives/2000/0715.html...

Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.

My favorite line from Schneier: "Unicode is just too complex to ever be secure".

And, no matter if you love Unicode or not, there's lots of wisdom in there.

When design-by-committee gives birth to something way too complex, insecurity is never far behind.

cruffle_duffle 1 year ago

If you tried to come up with a “lightweight” Unicode alternative it would almost certainly evolve right back into the clusterfuck that Unicode is. In fact the odds would mean it would probably be even worse.
Unicode is complex because capturing all the worlds writing systems into a single system is categorically complex. Because human meatspace language is complex.
And even then if you decided to “rewrite the worlds language systems themselves” to conform to a simpler system it too would eventually evolve right back into the clusterfuck that is the worlds languages.
It’s inescapable. You cannot possibly corral however many billion people live on this planet into something less complex. Humans are too complex and the ideas and emotions they need to express are too complex.
The fact that Unicode does as good of a job as it does and has stuck around for so long is a pretty big testament to how well designed and versatile it is! What came before it was at least an order of magnitude worse and whatever replaces it will have to be several orders of magnitude better.
Whatever drives a Unicode replacement would have to demonstrate a huge upset to how we do things… like having to communicate with intelligent life on other planets or something and even then they probably have just as big of a cluster fuck as Unicode to represent whatever their writing system is. And even then Unicode might be able to support it!
moefh 1 year ago

> it's just that Unicode is a total and complete clusterfuck
[...]
> When design-by-committee gives birth to something way too complex, insecurity is never far behind.
Human writing is (and has historically been) a "clusterfuck". Any system that's designed to encode every single known human writing system is bound to be way too complex.
I almost always side with blaming systems that are too complex or insecure by design as opposed to blaming the users (the canonical example being C++), but in the case of Unicode there's no way to make a simpler system; we'll keep having problems until people stop treating Unicode text as something that works more or less like English or Western European text.
In other words: if your code is doing input validation over an untrusted Unicode string in the year of our Lord 2024, no one is to blame but yourself.
(That's not to say the Unicode committee didn't make some blunders along the way -- for instance the Han unification was heavily criticized -- but those have nothing to do with the problems described by Schneier).
Sharlin 1 year ago

How could you ever make it simple given that the problem domain itself is complex as fuck? Should we all just have stuck with code pages and proprietary character encodings? Or just have people unable to use their own languages? Or even to spell their own names? It’s easy for a culturally blind English speaker to complain that text should be simple, must be due to design by committee that it isn’t!
SAI_Peregrinus 1 year ago

Unicode is worse than design-by-committee. It's a design-by-committee attempt to represent several hundred design-by-culture systems in one unified whole. Desgin-by-culture is even messier than design-by-committee, since everyone in the culture contributes to the design and there's never a formal specification, you just have to observe how something is used!
Arnt 1 year ago

Could you try an argument that unicode is insecure compared to roll-your-own support for the necessary scripts? You may consider "necessary" to mean "the ones used in countries where at least two of Microsoft, Apple and Sun sold localised OSes".