Comment by pimterry
4 years ago
I work on a complex desktop application, and it's been astounding the number of bugs that have appeared over the years triggered by spaces and other unusual characters in file names. If you do anything with subprocesses or path processing, it's absurdly easy to hit in a thousand different ways, over and over again.
Pro tip: rename your development directory (or even better: the workspace path in CI) to put a space and/or special characters in it.
Forces you to deal with this properly, and immediately ensures that every automated test checks this case without you having to remember every time. Hasn't been particularly inconvenient, since I'm autocompleting it 99% of the time anyway, and I haven't shipped a single path parsing bug since.
Seems like MS had the same idea according to an answer in the link:
> Microsoft intentionally made programs install to C:\Program Files on Windows 95+ to force programmers to deal with spaces in filenames.
I wish they did "User Files" instead of "Users" too, because so much software breaks on the home area having a space in it.
Not least, it makes writing scripts for various shells and getting the quoting rules right an absolute pain as well...
They used to. The folder was called `Documents and Settings` until Win7.
98 replies →
Huh, spaces. There's way too much software, especially on Windows, that breaks when there are Cyrillic characters in a path. I'll let you guess how I found out.
10 replies →
If you have a username with your full name (plus point if you have special characters in your name), you will get the whole deal with shitty programs. I’m not sure if it’s me, but there were cases I simply could not use a program installed in such a location, to the point where at my previous (admittedly shitty) workplace, we often installed software in a root location…
Laughs in C:\PROGRA~1\ (try it, still works in Windows 10)
Apart from what others mentioned, that can only work if the file system automatically creates 8.3 names. NTFS does not necessarily do that (https://docs.microsoft.com/en-us/windows-server/administrati...)
There is no guarantee that the short name has that. In fact on a lot of German Windows installations it was PROGRA~2.
12 replies →
Truly lifesaving for when she'll quoting gets in the way.
3 replies →
And yet they introduced C:\ProgramData in later versions.
Imagine if they made programmers put 64 bit DLLs in a "System32" directory and 32 bit DLLs in a "SysWoW64" directory. That would really keep 'em on their toes!
10 replies →
why "yet"?
one occurrence is enough to make devs care about it
It not only keeps people on their toes due to the whitespace. The folder name is even localized. E.g. with german settings there is C:\Programme and c:\Programme (x86).
You can still use the English names, though.
I wonder how much global work could have been saved if Microsoft also provided a covered interface for all paths in the system. Not sure if there is any, but one good implementation might save thousands of poor implementations required to handle it.
You mean like the Environment.SpecialFolders enum?
https://docs.microsoft.com/en-us/dotnet/api/system.environme...
There are several other classes that take care of getting folders, least of which checking system variables.
You have %Appdata% and friends.
On the other hand their case sensitivity behaviour means that “cross-platform” Java applications can break if they are run on a non-windows platform where opening files is case sensitive (unlike on windows)
It's actually a feature.
Easier to add a flag to ignore case rather than fix bugs where files only differ by case and are therefore overwritten on a case-insensitive filesystem.
Then they made poor APIs so that you have to do this to get it correct:
https://docs.microsoft.com/en-gb/archive/blogs/twistylittlep...
In nix at least you can call execve or other APIs that take a char argv[] and the whole problem is largely solved and you don't need to quote things.
I just wish they had a decent way to execute programs with arguments that might include spaces. But no, every program can do argument delineation differently.
And Microsoft even provides three different slightly incompatible ways to parse arguemnts: CommandLineToArgvW, the CRT and cmd.exe.
I know that at least like, idk like 3-5 years ago, when I had gotten a new windows laptop (windows 7 or 8 I think), setting the main account to have the name "" (without the quotes), caused some problems with the basic functioning, including, I think, with some pre-installed programs,
So, some things were still being handled not quite right (whether that's because it shouldn't be allowed to be the username, or because programs should handle it being in the path, I'm not sure, but probably one of those.)
And then to really mess you up and ensure you handle parens properly, threw “(x86)” into the mix. (A real pain on some REPLs as well as dealing with environment variables).
Except for programs that were too old / obscure to fix I guess. I think at least the Symbian Development Kit was such that builds would fail with strange errors unless you installed it in any other path than the default immediate subdirectory of C:\, let alone under "Program Files".
Plenty of new stuff does this. As long as youre not .net or javascript nobody scrutinizes the trash work developers charge money for.
Funny, in the Italian Win9x it is C:\Programmi, which I always thought was more convenient because of the lack of spaces :)
Sure. Microsoft only ever ships features
At one time there was no number 0. Half of binary was missing.
Shame it wasn't
> C:\P̷̧̽r̸̬͘ŏ̵̮g̷̜͘r̸̦̋a̴͎̒m̶̲̈́ ̷̠̉F̵͇̈ĩ̴̫l̶̨͗ë̵̦s̸͚͆\
There was a short path name IIRC like prog~1
Could you please link the reference?
C:\PROGRA~1
Easy fix!
> Pro tip: rename your development directory (or even better: the workspace path in CI) to put a space and/or special characters in it.
A former co-worker changed his name in our auth system to include an apostrophe, so that whenever we handled names wrong he'd find it.
I set my nickname to U+FFFD at one point in one work system, resulting in a variety of bug reports and concerned emails. I think I dropped it since it was generating false reports from people who didn't check what character the page contained before reporting it.
To have such thoughtful coworkers. On an old team I had two coworkers named Chris and once in a blue moon when they reviewed each other code master would start crashing because one of them accidentally left in an absolute path starting with "/home/chris/".
A related too for CI: change the system time to be a time zone that is during your work hours in a different day already than UTC. Really helped getting failures earlier than 4pm PST.
Could you consider rephrasing this? It sounds like an interesting observation that I'd love to understand, but I'm genuinely not able to parse it.
My best guess is "change the system time to be a timezone for which, during your work hours, the other-timezone is in a different day than UTC is" - but I'm still not sure what effect that would have on CI failures.
2 replies →
At my last job we had a wild time-zone bug that only happened with your system location set to Mumbai. I left mine set to that for the rest of my time there.
1 reply →
One of the systems I built is being used by a group of younger people. I included an emoji in the superuser account name, just to make sure it would work. And to remind me to think more broadly about user input.
I've used to have a space in my user name and even contemplated to add a bit of non-1252 Unicode. You find a lot of issues, but unfortunately often in tools you have little control over and end up not being able to work effectively at times. It ended up being more frustrating than helpful.
Áčçëñts hęlp tóø
For anyone curious, this is called Pseudo-localization (https://en.wikipedia.org/wiki/Pseudolocalization). I first singled across this in Raymond Chen's blog.
I add a Japanese character into any .py, .js and .html file to ensure that Unicode is working properly through the entire chain. Mostly in form of a variable which gets passed along, even in URL parameters.
my test accounts always have emojis + accents + other weird characters.
it keeps everybody on their toes lol.
the proper name of the glorious sultan of slack, j. r. "bob" dobbs, has the quotation marks and therefore is a great subject for this
Oh, I like this!
Obligatory xkcd https://xkcd.com/327/
> it's been astounding the number of bugs that have appeared over the years triggered by spaces and other unusual characters in file names
If you consider spaces “unusual” I would say you haven’t encountered a single average user in your lifetime. Spaces in file-names is the single most common thing people have, outside programming environments.
As a x-plat developer, the only platform where I (still) regularly encounter these kind of bugs are platforms where solving problems through scripting is common, like Linux, where the primary means of operation is through stringly-typed statements getting parsed and processed in a untyped-fashion. It's not very reliable.
On Windows people more often use “real APIs” (because scripting doesn't really work as well), but then these problems just goes away.
Pros and cons, I guess.
It's especially funny that it affects Linux so much. Most file systems allow everything except `/` and NULL in file names. Early AT&T UNIX even allowed NULLs! POSIX shells use the IFS variable to perform field splitting, and it defaults to <space>, <tab>, and <newline>. The choice to perform field splitting by default (particularly with spaces in the default IFS set) has caused no end of headaches for developers and users.
It doesn't even have to be complex, often basic automation tasks fail with spaces and special characters. Honestly, treating a file system like a natural language processor is a bad idea. Besides at this point with how digital we have all become who can't understand...
thisismyconfig.txt vs this is my config.txt or this_is_my_config.txt
...i've forced myself to stop using spaces, character, and even cap. They are all constructs that provide minimal value for the extra complexity.
> thisismyconfig.txt vs this is my config.txt or this_is_my_config.txt
Just wondering, what is the readability of this for people who are dyslexic?
I'm not sure, but my gut instinct is that it wouldn't help. Dyslexia rates are much lower in China, so if I suppose we could start naming files with Chinese characters (on systems that support Unicode). It would take a bit to get used to, but eventually we'd develop a pidgin language for when we talk about software, much like how if you overhear Chinese or Vietnamese developers they will mix in English words like "linked list" into their sentences, because there's not a more natural sounding alternative.
Switching to Chinese would also help eliminate the spaces issue.
tbh I'm not dyslexic and realized the spaces make it really difficult to know what the filename actually is. If you just take the second example, how would you know if the file was "this is my config.txt" versus "config.txt"?
Aside from parsing errors it just seems to lend itself to ambiguity.
1 reply →
Or in my case, people for whom English is a second language, or have low education levels.
Saying, "who can't understand..." is arrogant, selfish, and an example of why normal people hate people in the SV echo chamber.
8 replies →
I'm similar, but I would like to support labels intended for humans, along with various translations, as metadata on top of e.g. filesystem path components.
You nailed it - getting rid of spaces and dashes and underscores is extremely human-hostile. People added spaces to the English language for a reason, and that's because they make it way easier to read.
Your system is only intended for other programs to interact with? Go nuts, make hex UUIDs. Actual people are supposed to use it? You need separator characters.
I also don't see how those characters add "extra complexity" unless you're doing dumb things like text processing on paths and filenames (as opposed to using OS/library functions that handle paths correctly) - in which case, there's your problem.
Why stop there. A computer works more efficiently with numbers rather than strings, so let’s just give each file a number instead of a string. Besides, at this point with how digital we have all become who can’t understand… But wait, that already exists and is called an inode.
A file system has a human interface and a computer interface. Don’t mix them. Let users give file names in whichever way they please.
> treating a file system like a natural language processor is a bad idea
could you please explain what you mean by that?
My favorite filename special character bug was when I implemented CD ripping in 2005, and one of our beta testers ripped a CD with a song called "Have You Ever?". My code wasn't prepared to filter out the question mark on Windows.
I just hit the one where an album folder ends in a period. Rsync copies every time because the period is dropped by the filesystem silently. :-/
> Pro tip: rename your development directory
I changed my username to not contain a space because it was too annoying to deal with all the random dev tools breaking. The worst offender was probably npx on Windows [1] (resolved after four years by deprecating npx), but it was far from the only one (though the JS ecosystem was somehow the worst in this regard of all languages I worked with).
1: https://github.com/zkat/npx/issues/100
Same, even I had to rename my user folder to not have a space because so many tools were breaking.
> other unusual characters in file names
Saw a few hacks where malware authors used the RTL feature (which is baked into Windows) to obfuscate file extensions. It looked like .exe.innocuous-document.docx, but was actually .docx.innocuous-document.exe
This exact vulnerability in most modern code editors just made the rounds, allowing smuggling malicious code right through review.
My Mac is formatted case sensitive when the default is case insensitive. This will also catch a ton of import related bugs.
League of legends doesn’t run until I sed files for instance.
I once returned a printer because the Mac driver and support software expected and enforced case insensitive access and basically couldn't install properly on my case-sensitive HFS+ volume. It half installed and blatantly just didn't work in any way when installed.
Adobe software used to refuse to install on case sensitive file systems back in the not too distant past.
I have coworkers on Mac that write node/JS code. Every once in awhile I'd pull down the latest code and it wouldn't run. I'm on Linux.
Sure enough, they had SomeFile and were importing Somefile and it works fine on Mac but not on Linux (which, of course, is what our production servers use). It amazes me that "works fine on my machine" is still a thing when I definitely worked at companies that solved this back in the 2000s. It was solved. It was done. Then devs became enamored with running everything locally. Even dozens of microservices or databases. Even though JS is fairly isolated, you still have NPM packages that need built against the local OS and C/C++ library and compilers, etc. Which also has caused issues in the past.
Good news, we have solutions. You could use continuous integration and software containers like Docker.
6 replies →
my favorite is often being the only developer on linux and giving two files with different casing and watching their systems crash and burn.
I also enjoyed doing that, but had to make a DMG just for Steam because it straight-up refuses to run on a case sensitive FS (that's true on Windows, also, which I suspect is how we all got here). I think the most recent Steam versions either caught wind of my trickery or -- more likely -- run something from $HOME/Library/SomethingOrOther and thus the work-around it no longer works
When I got a new Mac, I just gave up and acquiesced to the case-retentive world :-(
Circa Y2k, I learned that the OSX Palm Pilot software didn't work with case sensitive. I've since given up and stuck with the default. (I'm anti-case folding in general, because of the ambiguity.)
I maintain a similar system, where a variety of companies submit files that get processed through multiple services - it is astounding how ridiculous people’s naming of files can be; spaces are the least concerning!
> anything with subprocesses
I'm begging software developers to stop using subprocess APIs that take a string argument (system(), child_process.exec(), Process.Start(string)) and start using subprocess APIs that take an array of arguments (execvp(), child_process.execFile(), Process.Start(string, IEnumerable<string>).)
While I agree that we should do this in the ideal world, doing so will inevitably break other necessary tools so it is unworkable for me :(
And add a emoji, a character in a right to left language ( א) and perhaps 太. Maybe italicize one of those too...
Spaces are a pain in the ass when you're using CLI so I'd rather enforce a no space policy
Most shells will behave just fine if you put a quote (single or double) before anything that has a space.
A small extra step but something you get used to if you spend a lot of time in the cli.
Escaping spaces is a pain. I have to do it every day.
I set up symlinks which help navigating around but then the relative paths are wrong for git.
No thanks.
Friends don't let friends put spaces in paths
I don't know if it's still a problem, but it used to break Python virtualenv badly. If your working directory had a space anywhere in the path, it would throw a huge fit and not work. Which is problematic when the expected name for a Mac's boot drive is "Macintosh HD" (if you ever had a reason to run a virtualenv outside of your home directory).
Pro tip2: Use std lib path processing utilities
Sometimes / works as a path separator in Windows, sometimes it doesn't. It's not predictable.
I never use / on Windows as a result.
The only common place where it doesn't work is in CMD for executing programs and as arguments for built-in commands. Everything else goes directly to the relevant APIs which don't care about / or \.
These days using CMD instead of PowerShell should be rare enough and PowerShell certainly doesn't mind the slashes.
It's easy to tell users to make a folder with no spaces if you're setting up a global path, however if you have an application that runs in user directories things can become painful fast. Changing your user name is a pain and can leave things inconsistent, but having to handle all the variations in people's names with spaces, punctuation, international characters, can just be mind boggling.
I did something similar on accident. I used to keep all my development work synced with Dropbox and I had a work and a personal account. So any of my own projects would have /Dropbox (Personal)/ in the path which did catch some bugs. Dropbox renamed my folder to "Dropbox (Personal)" automatically when connecting a work account.
More importantly than your source files, put your testing data on such a path as well. Nobody uses absolute paths in testing so it doesn't matter how many spaces your absolute path has if your input is "./tests/file1". Put those files in a folder with spaces too and throw in a unicode character for good measure.
> Pro tip: rename your development directory (or even better: the workspace path in CI) to put a space and/or special characters in it.
The problem with that is that YOUR code may handle it, but your tooling may not. If my code formatter break on spaces, I'm not going to change the formatter.
You could submit a PR to their repo.
I could submit a PR to 5 tools a week on average. I actually have the time and resources to do it once a year.
Last week I opened a ticket for a Firefox bug. Following up on the bug took me 2 hours in total.
FOSS is not free, you pay it with your time. And as with everything you pay for, we all have a budget.
Somewhat related to injecting unusual characters, in my experience in localization efforts:
Inject a Turkish 'I'. I don't know how to type or paste it here, but picture an English lower case 'i' that is upper case. It is a splendid way among many to shake out some loc bugs.
İ
From https://en.wikipedia.org/wiki/%C4%B0
That would only shake out anything if you'd also test in a Turkish locale, wouldn't it? Since Unicode casing rules are locale-dependent and en-US doesn't care much about dotless i or dotted i.
Late '90s I worked on Java software that got installed on several Unix platforms, including Linux for IBM mainframes. When you deal with the default en/de-coding of Unicode to EBCDIC you never have trouble with Java byte encodings ever again.
Someone should provide the OneDrive/SharePoint people some of this religion.
Mysterious character requirements that do not conform with Microsoft’s OS limits, limits on tbe fully qualified pathname length, etc.
Let's not forget return carriages in filenames within apps...
Even capitalization is a pain in the ass thanks to how OSes treat file names. I pretty much stick with either `file-name.ext` or `file_name.ext` exclusively now.
Today I learned that You cannot install Tailscale on windows if installer is inside path with non-latin chars.
In that case, be thorough and insert a Chinese and an Arabic character to enforce a Unicode check.
See the recent article about unicode invisible glyphs in JavaScript or bash.
Naming freedom needs a stdlib module
For those purposes I've found hyphen to be a nice substitute.
Better solution: only allow ASCII, maybe dashes, and up to twelve characters. Problem solved.
Enforce this in LDAP.
Strict convention is better than flexibility and predicting obscure edge cases that can fail.
In my case, and for many people writing desktop software, and for absolutely everybody writing open-source tools or libraries, unfortunately you can't control the environment.
Non-ASCII paths are extremely common (e.g. the user's home directory on Windows, for the large majority of users outside the English-speaking world) and spaces, punctuation and weirder characters will definitely happen when you least expect it.
Yes if you can avoid it then absolutely that's great, but I don't think most people can.
It's also not usually very difficult to deal with, as long as you actually spot the issue in the first place.
only allow ASCII, maybe dashes, and up to twelve characters. Problem solved
...and only hire people from the exact same background as you, who will never have unusual characters or accents in their name. And also make sure not to have any users who aren't exactly like you, and conform to this very narrow requirement. Surely, excluding 90% of the world won't hurt revenue in any way.
Snarky, but I'll take it.
Use strict schema for the hardware interface, networking, physical stuff the user never sees. Microservice names don't need to be non-Latin. Database replicas, infrastructures, etc. And you're not going to piss off employees by giving them ASCII ldap/email addresses.
Use utf8mb4 or similar for storing names. Don't state "first" or "last". I've been through this rodeo too many times. You're not surprising anyone.
This is not excluding? I just use an ascii canonicalized version of my name and works fine.
UTF-8 strings aren’t reproducible anyways. User ID should be strictly for identification, be alphanumeric random string if necessary.
You can use an "ASCII-fied" version of the name, only ~27% of mine can be typed in ASCII letters that look similar but the rest is just phonetically or visually close-enough letters. This is something people did for decades and nowadays even government IDs have an ASCII-fied (well, Latin-fied) version of the name.
Ugh, we have the 15 character Active Directory limit now with hostnames, and a previous IT administration has imposed a convention that every name had to follow [prod|dev]-[ph|vm]-[service]-[nn]. So basically every production service is prod-vm-owtf-01— you get exactly four characters to actually describe what the machine does. Works great when the service is "jira" or "wiki", but there are a lot that are pretty mystical-sounding, like jkns, jwrk, cntr, hrbr, etc, where you kind of just have to know.
Do they at least allow you to set up CNAMEs?
1 reply →
I kind of like that honestly. No doubt you need some documentation so everyone knows what the service abbreviations are, but after you've been working there for a month you get it. Makes everything clean, consistent, and informational. You can quickly ascertain what a specific host is doing just from the name.
1 reply →
Ah, that's the he enterprise edition.
But then your program will crash hard and unexpectedly when a user decides to save under "~/house plans" or ~/Téléchargements.
I think it's better to exercise this in CI, that's what CI is for.
there are things you cant do in .net that you need the old Registry commands for and those don't accept spaces
And yet OneDrive WP t allow fir spaces before or after a file name.
I spent hours trying to figure out why an entire folder suddenly stopped syncing. Turns out I accidentally added a hidden space to the end of a folder name.
Yup, their UI sucks when it comes to sync errors.
Or not, which when bugs crop up will teach the businessy types to stop putting spaces in their filenames.
The beatings will continue until morale improves?
Spaces are very useful for readability.
depends entirely what you're using to browse files
> Pro tip: rename your development directory (or even better: the workspace path in CI) to put a space and/or special characters in it.
This will also break any code in external tools that are called during the builds of your application and do not handle spaces correctly for whatever reason, thus making it so that you won't be able to successfully finish the build.
Then again, you probably shouldn't be relying on technologies like that, but when you're struggling to keep an old enterprise system alive, causing yourself more problems is not necessarily what you should do.
Still a good idea in most cases, though.