← Back to context

Comment by thasso

1 day ago

This part is bewildering to me:

> Now, if you try to watch file descriptor 2000, select will loop over fds from 0 to 1999 and will read garbage. The bigger issue is when it tries to set results for a file descriptor past 1024 and tries to set that bit field in say readfds, writefds or errorfds field. At this point it will write something random on the stack eventually crashing the process and making it very hard to debug what happened since your stack is randomized.

I'm not too literate on the Linux kernel code, but I checked, and it looks like the author is right [1].

It would have been so easy to introduce a size check on the array to make sure this can't happen. The man page reads like FD_SETSIZE differs between platforms. It states that FD_SETSIZE is 1024 in glibc, but no upper limit is imposed by the Linux kernel. My guess is that the Linux kernel doesn't want to assume a value of FD_SETSIZE so they leave it unbounded.

It's hard to imagine how anyone came up with this thinking it's a good design. Maybe 1024 FDs was so much at the time when this was designed that nobody considered what would happen if this limit is reached? Or they were working on system where 1024 was the maximum number of FDs that a process can open?

[1]: The core_sys_select function checks the nfds argument passed to select(2) and modifies the fd_set structures that were passed to the system call. The function ensures that n <= max_fds (as the author of the post stated), but it doesn't compare n to the size of the fd_set structures. The set_fd_set function, which modifies the user-side fd_set structures, calls right into __copy_to_user without additional bounds checks. This means page faults will be caught and return -EFAULT, but out-of-bounds accesses that corrupt the user stack are possible.

> Maybe 1024 FDs was so much at the time when this was designed that nobody considered what would happen if this limit is reached? Or they were working on system where 1024 was the maximum number of FDs that a process can open?

The article says select is from 1983. 1024 FDs is a lot for 1983. At least in current FreeBSD, it's easy to #define the setsize to be larger if you're writting an application that needs it larger. It's not so easy to manage if you're a library that might need to select larger FDs.

Lots of socket syscalls include a size parameter, which would help with this kind of thing. But you still might buffer overflow with FD_SET in userspace.

You (and the author) are misunderstanding. These are all userspace pointers. If the process passes the kernel a buffer and tells it to access it past the end, the kernel will happily do so. It applies all the standard memory protection rules, which means that if your pointer is unmapped or unwritable, the kernel will signal the error (as a SIGSEGV) just as if the process had touched the memory itself.

It's no different that creating a 1024 byte buffer and telling read() to read 2048 bytes into it.

To be fair there's an API bug here in that "fd_set" is a fixed-size thing for historical compatibility reasons, while the kernel accepts arbitrarily large buffers now. So code cutting and pasting from historical examples will have a essentially needless 1024 FD limit.

Stated differently: the POSIX select() has a fixed limit of file descriptors, the linux implementation is extensible. But no one uses the latter feature (because at that scale poll and epoll are much better fits) and there's no formal API for it in the glibc headers.

  • I don't get where my misunderstanding lies. Didn't I point out that the __copy_to_user call returns EFAULT if the memory is unmapped or unwritable? The problem is that some parts of the user stack may be mapped and writable although they're past the end of the fd_set structure.

    > there's no formal API for it in the glibc headers

    The author claims you can pass nfds > 1024 to select(2).If you use the fd_set structure with a size of 1024, this may lead to memory corruption if an FD > 1023 becomes ready if I understand correctly.

    • Once more, the kernel has never been responsible for managing userspace memory. If the userspace process directs the kernel to write to memory it didn't "intend" the kernel to write to, the kernel will happily do so. Think again on the example of the read() system call I mentioned. How do you propose to fix the problem there?

      The "problem", such as it is here, is that the POSIX behavior for select() (that it supports only a fixed size for fd_set) was extended in the Linux kernel[1] to allow for arbitrary file descriptor counts. But the POSIX API for select() was not equivalently extended, if you want to use this feature you need to call it with the Linux system call API and not the stuff you find in example code or glibc headers.

      [1] To be perfectly honest I don't know if this is unique to Linux. It's a pretty obvious feature, and I bet various BSDs or OS X or whatnot have probably done it too. But no one cares because at the 1024+ FD level System V poll() is a better API, and event-based polling is better still. It's just Unix history at this point and no one's going to fix it for you.

      2 replies →