Comment by xdennis
4 years ago
Looks like I'm in the minority. I always use spaces and non-ASCII characters in filenames.
In many languages it's a requirement. For example, in Romanian, there are 8 words that collide with „fata“ if you remove the diacritics (fata, fată, fața, față, făta, făță, fâța, fâță).
Given that we have to use diacritics, spaces don't seem like a big deal.
> Given that we have to use diacritics, spaces don't seem like a big deal.
There is one big difference: CLI utilities don't usually care about diacritics (though encoding issues can throw a wrench in that), but they care a lot about spaces. So putting spaces in filenames requires properly quoting or escaping parameters, whereas diacritics does not. That makes one-off shell snippets and scripts a lot more annoying (though TBH I tend to shy away from those anyway, these days).
So do I. I have a language, and I'm not afraid to use it. My computer should speak it just as well as I do.
There's a server at work that name with a non-ascii character. I've run into compatibility issues lots of times where I can't connect. I prefer to just use English with ASCII and be happy
Server names are different. They are by and large machine-facing identifiers, whereas filenames have a 50-50 split of whether they are machine-facing, human-facing, or both. They makes their support of Unicode a much more critical (and appealing) proposition.
1 reply →
We have a few words that depend on diacritics to be unique in Czech as well - though not as bad as this example - but people just manage without. Hell, I don't even bother installing the Czech keyboard, if I REALLY need it (like in names), I just google for words that have the character and copy it
So how did you deal with it in the 80s/90s?
Not sure about Romanian, but for many other languages people essentially came up with transliteration schemes (multiple, incompatible, ambiguous) to squeeze your language into ascii.
The resulting text was understandable by the "computer people" but not the general population who did not use the networks back then, perhaps somewhat comparable to when some time ago USA parents encountered the "SMS slang" used by their teenagers.
As you would assume: use ASCII and deduce from context. Many people still do that.
That has lead to phantom diacritics: reading letters in unfamiliar words/names based on what you assume they are. For example some pronounce Chirica as Chirică because they assume someone forgot to type the breve in ă.
I call it the habanero trap. There is no ñ in "habanero", yet a lot of people say "habanyero", probably by analogy with "jalapeño".
Back in the day there were dozens of character sets that were alternatives to US-ASCII. Having once worked on an Email client, I needed to bake in a bunch of translation tables to convert stuff sent that way into UTF-8.
Hmmm, I thought I was fluent in Romanian (born there and lived there for 26 years), but I only know 5 of those 8 words...
According to Google Translate the first two are "girl" and the rest are "face". =)
Google Translate is a horrible tool for "translating" single words or lists of unrelated words.
Use a proper dictionary for that. The very nature of statistical models makes proper translation without context impossible for these systems, especially when uncommon words and diacritics are involved.
* fata - the girl
* fată - girl
* fața - the face
* față - face
* făta - was giving birth
* făță - a small fish, or a child who won't sit still
* fâța - was fussing
* fâță - variant of făță
As you might infer from the first 4, Romanian uses postfix "the" and for singular feminine words you can't tell the difference if you use only ASCII.
That doesn't seem unusual. Only the first 5 are very common.
>In many languages it's a requirement. For example, in Romanian, there are 8 words that collide with „fata“ if you remove the diacritics
That is what context is for.