Comment by JoshTriplett
2 years ago
> The system ABI of Linux really isn't the syscall interface, its the system libc.
You might have reasons to prefer to use libc; some software has reason to not use libc. Those preferences are in conflict, but one of them is not automatically right and the other wrong in all circumstances.
Many UNIX systems did follow the premise that you must use libc and the syscall interface is unstable. Linux pointedly did not, and decided to have a stable syscall ABI instead. This means it's possible to have multiple C libraries, as well as other libraries, which have different needs or goals and interface with the system differently. That's a useful property of Linux.
There are a couple of established mechanism on Linux for intercepting syscalls: ptrace, and BPF. If you want to intercept all uses of a syscall, intercept the syscall. If you want to intercept a particular glibc function in programs using glibc, or for that matter a musl function in a program using musl, go ahead and use LD_PRELOAD. But the Linux syscall interface is a valid and stable interface to the system, and that's why LD_PRELOAD is not a complete solution.
It's true that Linux has a stable-ish syscall table. What is funny is that this caused a whole series of Samsung Android phones to reboot randomly with some apps because Samsung added a syscall at the same position someone else did in upstream linux and folks staticly linking their own libc to avoid boionc libc were rebooting phones when calling certain functions because the Samsung syscall causing kernel panics when called wrong. Goes back to it being a bad idea to subvert your system libc. Now, distro vendors do give out multiple versions of a libc that all work with your kernel. This generally works. When we had to fix ABI issues this happened a few times. But I wouldn't trust building our libc and assuming that libc is portable to any linux machine to copy it to.
> It's true that Linux has a stable-ish syscall table.
It's not "stable-ish", it's fully stable. Once a syscall is added to the syscall table on a released version of the official Linux kernel, it might later be replaced by a "not implemented" stub (which always returns -ENOSYS), but it will never be reused for anything else. There's even reserved space on some architectures for the STREAMS syscalls, which were AFAIK never on any released version of the Linux kernel.
The exception is when creating a new architecture; for instance, the syscall table for 32-bit x86 and 64-bit x86 has a completely different order.
I think what they meant (judging by the example you ignored) is that the table changes (even if append-only) and you don't know which version you actually have when you statically compile your own version. Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented, or b) implemented with something bespoke.
5 replies →
The complication with the linux syscall interface is that it turns the worse is better up to 11. Like setuid works on a per thread basis, which is seriously not what you want, so every program/runtime must do this fun little thread stop and start and thunk dance.
Yeah, agreed. One of the items on my long TODO list is adding `setuid_process` and `setgid_process` and similar, so that perhaps a decade later when new runtimes can count on the presence of those syscalls, they can stop duplicating that mechanism in userspace.