Comment by solatic
14 hours ago
Headline is wrong. I/O wasn't the bottleneck, syscalls were the bottleneck.
Stupid question: why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory? Seems like the simplest solution, no?
One aspect of the question is that "permissions" are mostly regulated at the time of open and user-code should check for failures. This was a driving inspiration for the tiny 27 lines of C virtual machine in https://github.com/c-blake/batch that allows you to, e.g., synthesize a single call that mmaps a whole file https://github.com/c-blake/batch/blob/64a35b4b35efa8c52afb64... which seems like it would have also helped the article author.
It's not the syscalls. There were only 300,000 syscalls made. Entering and exiting the kernel takes 150 cycles on my (rather beefy) Ryzen machine, or about 50ns per call.
Even assume it takes 1us per mode switch, which would be insane, you'd be looking at 0.3s out of the 17s for syscall overhead.
It's not obvious to me where the overhead is, but random seeks are still expensive, even on SSDs.
Didn't test, but my guess is it's not “syscalls” but “open,” “stat,” etc; “read” would be fine. And something like “openat” might mitigate it.
io_uring supports submitting openat requests, which sounds like what you want. Open the dirfd, extract all the names via readdir and then submit openat SQEs all at once. Admittedly I have not used the io uring api myself so I can't speak to edge cases in doing so, but it's "on the happy path" as it were.
https://man7.org/linux/man-pages/man3/io_uring_prep_open.3.h...
https://man7.org/linux/man-pages/man2/readdir.2.html
Note that the prep open man page is a (3) page. You could of course construct the SQEs yourself.
You have a limit of 1k simultaneous open files per process - not sure what overhead exists in the kernel that made them impose this, but I guess it exists for a reason. You might run into trouble if you open too many files at ones (either the kernel kills your process, or you run into some internal kernel bottleneck that makes the whole endeavor not so worthwhile)
That's mainly for historical reasons (select syscall can only handle fds<1024), modern programs can just set their soft limit to their hard limit and not worry about it anymore: https://0pointer.net/blog/file-descriptor-limits.html
Not sure, I'd like that too
You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong
You can probably do it with io_uring, as a generic syscall batching mechanism.
>why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory?
You mean like a range of file descriptors you could use if you want to save files in that directory?
io_uring can open multiple files.
If you don't need the security at all then yes. Otherwise you need to check every file for the permissions.
What comes closest is scandir [1], which gives you an iterator of direntries, and can be used to avoid lstat syscalls for each file.
Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.
[1] https://man7.org/linux/man-pages/man3/scandir.3.html
This is a (3) man page which means it's not a syscall. Have you checked it doesn't call lstat on each file?
Fair, https://www.man7.org/linux/man-pages/man2/getdents64.2.html is a better link. You'd have to call lstat when d_type is DT_UNKNOWN
in what way does scandir avoid stat syscalls?
Because you get an iterator over `struct dirent`, which includes `d_type` for popular filesystems.
Notice that this avoids `lstat` calls; for symlinks you may still need to do a stat call if you want to stat the target.