Comment by branko_d

4 years ago

I have an uneasy feeling whenever I see a path parameter declared as string. Path is not a string - it's a sequence of path components and should be treated as such by our APIs. A path should be parsed once - on user input - and then used in its "sequence form" throughout the software stack.

And "path component" is not an arbitrary string either - e.g. appending a path component to the path should first require converting/parsing the string into the path component, and only if that's successful appending it to the path.

26 comments

branko_d

jerf 4 years ago

"Path is not a string - it's a sequence of path components and should be treated as such by our APIs."

For maximum correctness, you want to turn it into a file handle as soon as possible, and do all operations through the variations of the file functions that end in "at", like: https://linux.die.net/man/2/openat

The downside of this approach is that you still technically have to carry the path around with you if you ever want to present it back to the user, because once you have a directory handle, you can get back to the root directory easily enough by following parent links and seeing what directories you end up in, but that may not be what the user "thinks" the path is, and they want to see their path, not a canonicalized one. And they're mostly right. And it's not easy to correctly track changes to their intended path from this basis either.

Basically, I don't know of a really solid, 100% correct way to handle this with any reasonable degree of effort.

jmull 4 years ago

> For maximum correctness, you want to turn it into a file handle as soon as possible
That's not right. You want to resolve a file/folder path to a file/folder at the exact point it makes sense.
It's a problem if you're using a path when you wanted the file. The file can be switched/modified out from underneath you.
It's also a problem if you've got the file when you only wanted a reference. Now you can't simply switch/modify the file independent of the reference. E.g., maybe you want config file changes to take effect immediately and transparently.
You can also have the hybrid case, e.g., where you want the folder directly, but have a relative path to a file that is resolved late.
If you're unsure, I'd err on the side of late resolution.
Pxtl 4 years ago

"you want to turn it into a file handle as soon as possible"
But no sooner.
For example, I've run into problems where I'm configuring program A server to talk to file location B... but I don't have access to file location B. But the client-side library for talking to the server tries to convert location B into a file handle and then freaks out because I can't access it. When I don't want to access it. I want that program to serve it.
If it was using simple "path" objects that didn't confirm that I have access to the path, everything would be hunky dory. But because it tried to convert it into a file handle unnecessarily, I get blocked.
tmerr 4 years ago

Another inconvenience with this approach is that you can keep thousands of paths in memory no problem. But thousands of FDs may cause you to exceed per-process limits.
globular-toast 4 years ago
This goes for most instances of user input. Timestamps is the other common one people get wrong. I've even seen programs that pass around timestamps as strings in multiple formats and as integers (Unix time).
- aqfamnzc 4 years ago
  
  As a programming noob, I'm wondering what would be the better way to pass or return a unix time value as opposed to an integer?
  
  4 replies →
BoorishBears 4 years ago

> For maximum correctness, you want to turn it into a file handle as soon as possible
This is why I get stressed out when I see paths turned into special objects encoding separators and such.
It tells me the path is living for way too long compared to the file handle.
I only want to see path-specific objects if we're modifying the path, and even then I want that to happen as late as possible.
aspaceman 4 years ago

Why not just hold onto both? The users representation and the file handle. Only ever "display" the representation, while you do all operations on the handle. (Not trying to be sarcastic, just curious).
cerved 4 years ago

doesn't this lock the file?

dahfizz 4 years ago

> I have an uneasy feeling whenever I see a path parameter declared as string. Path is not a string

I guess that depends on what you mean by "string". `open` and `fopen` need a char* path to open a file. Whatever fancy Path abstraction you use eventually becomes a char* string, because that's what the kernel needs.

VWWHFSfQ 4 years ago
yeah. it's a string.
- dwheeler 4 years ago
  
  On POSIX systems file names are not strings, they are sequences of bytes. They might not be UTF-8 or have any meaning. Python3 had to hack around this, they thought they could force everything to Unicode and discovered that doesn't work.
  
  5 replies →

anyfoo 4 years ago

Strings following certain rules are entirely valid representations of paths, just like sequences of path components in the chosen language/framework are. Similarly, the sequences of bits that make up the sequences of your language/framework in memory are an entirely valid representation of said sequences of components.

Yes, paths have structure, but saying "a path is not a string" is equivalent of saying "C source code is not a string". Both are strings, and both are something else, represented by strings according to rules. Different internal representations have different advantages and disadvantages. I fully agree that for things such as "adding components" an internal sequence/list representation is better, but strings can pass arbitrary IPC or even ABI boundaries much easier for example. (And you wouldn't bat an eye for example when you see FQDNs like "www.google.com" passed as a string instead of as ["www","google","com"] because the string representation works pretty well.)

fouric 4 years ago
C source code and paths are both representable by strings, true, but the fact that they're not actually strings is still important, because most people don't know that, and in the case of paths that leads to a lot of edge cases (in the case of source code it leads to a bunch of inefficient and weak tooling, which isn't quite as bad).
Because neither are strings, their native representation shouldn't be such - it should be something structured, and only when necessary (IPC, FFI, serdes) be serialized into a string representation. This would save people a lot of time and effort.
- anyfoo 4 years ago
  
  It really depends. Do you usually keep hostnames as strings? URLs? JPEGs? Why or why not?
  Sure, a browser will hopefully quickly parse that URL and break it up, an image viewer will do the same with a JPEG. Will anything that's only interested opening/displaying that URL or JPEG, through a library or external program?
  POSIX paths are actually remarkably simple in structure[1]. The only caveat is equality and normalization: Without normalization, a path a might be equal to a path b while their representations differ, e.g. "/etc/foo" and "/etc/bar/../foo". But this is the same whether you have a string or a list of strings, you need to normalize in whatever representation you choose to check for equality.
  [1] Almost shocking myself, even Haskell defines its primary FilePath type literally as "String".

SAI_Peregrinus 4 years ago

POSIX "Fully portable filenames" allow all characters except 0x2F (/) and 0x00 (NULL). That means file names can include line feeds, backspaces, EOF, etc.

"This is `a

perfectly vali'd.\010! file name\377, despite the weirdness"

naikrovek 4 years ago

things like this are why the Unix philosophy is so bad.

text processing is hard if you must support Unicode, and that means every Unix command line tool must implement or employ a text processor to handle input. it would be much easier if objects were passed back and forth. PowerShell got this right.