Comment by solatic

1 month ago

Headline is wrong. I/O wasn't the bottleneck, syscalls were the bottleneck.

Stupid question: why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory? Seems like the simplest solution, no?

18 comments

solatic

cb321 1 month ago

One aspect of the question is that "permissions" are mostly regulated at the time of open and user-code should check for failures. This was a driving inspiration for the tiny 27 lines of C virtual machine in https://github.com/c-blake/batch that allows you to, e.g., synthesize a single call that mmaps a whole file https://github.com/c-blake/batch/blob/64a35b4b35efa8c52afb64... which seems like it would have also helped the article author.

ori_b 1 month ago

It's not the syscalls. There were only 300,000 syscalls made. Entering and exiting the kernel takes 150 cycles on my (rather beefy) Ryzen machine, or about 50ns per call.

Even assume it takes 1us per mode switch, which would be insane, you'd be looking at 0.3s out of the 17s for syscall overhead.

It's not obvious to me where the overhead is, but random seeks are still expensive, even on SSDs.

ncruces 1 month ago

Didn't test, but my guess is it's not “syscalls” but “open,” “stat,” etc; “read” would be fine. And something like “openat” might mitigate it.

levodelellis 1 month ago

Not sure, I'd like that too

You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong

king_geedorah 1 month ago

io_uring supports submitting openat requests, which sounds like what you want. Open the dirfd, extract all the names via readdir and then submit openat SQEs all at once. Admittedly I have not used the io uring api myself so I can't speak to edge cases in doing so, but it's "on the happy path" as it were.

https://man7.org/linux/man-pages/man3/io_uring_prep_open.3.h...

https://man7.org/linux/man-pages/man2/readdir.2.html

Note that the prep open man page is a (3) page. You could of course construct the SQEs yourself.

torginus 1 month ago
You have a limit of 1k simultaneous open files per process - not sure what overhead exists in the kernel that made them impose this, but I guess it exists for a reason. You might run into trouble if you open too many files at ones (either the kernel kills your process, or you run into some internal kernel bottleneck that makes the whole endeavor not so worthwhile)
- dinosaurdynasty 1 month ago
  
  That's mainly for historical reasons (select syscall can only handle fds<1024), modern programs can just set their soft limit to their hard limit and not worry about it anymore: https://0pointer.net/blog/file-descriptor-limits.html

arter45 1 month ago

>why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory?

You mean like a range of file descriptors you could use if you want to save files in that directory?

direwolf20 1 month ago

You can probably do it with io_uring, as a generic syscall batching mechanism.

paulddraper 1 month ago

io_uring can open multiple files.

justsomehnguy 1 month ago

If you don't need the security at all then yes. Otherwise you need to check every file for the permissions.

stabbles 1 month ago

What comes closest is scandir [1], which gives you an iterator of direntries, and can be used to avoid lstat syscalls for each file.

Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.

[1] https://man7.org/linux/man-pages/man3/scandir.3.html

direwolf20 1 month ago
This is a (3) man page which means it's not a syscall. Have you checked it doesn't call lstat on each file?
- stabbles 1 month ago
  
  Fair, https://www.man7.org/linux/man-pages/man2/getdents64.2.html is a better link. You'd have to call lstat when d_type is DT_UNKNOWN
zokier 1 month ago
in what way does scandir avoid stat syscalls?
- stabbles 1 month ago
  
  Because you get an iterator over `struct dirent`, which includes `d_type` for popular filesystems.
  Notice that this avoids `lstat` calls; for symlinks you may still need to do a stat call if you want to stat the target.