← Back to context

Comment by zbowling

2 years ago

And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases. On Fuchsia, we made direct syscalls not through the vDSO illegal and it was funny the hacks to go that required. The system ABI of Linux really isn't the syscall interface, its the system libc. That's because the C ABI (and the behaviors of the triple it was compiled for) and its isms for that platform are the linga franca of that system. Going around that to call syscalls directly, at least for the 90% of useful syscalls on the system that are wrapped by libc, is asinine and creates odd bugs, makes crash reporters heuristical unwinders, debuggers, etc all more painful to write. It also prevents the system vendor from implementing user mode optimizations that avoid mode and context switches when necessary. We tried to solve these issues in Fuchsia, but for Linux, Darwin, and hell, even Windows, if you are making direct syscalls and it's not for something really special and bespoke, you are just flat-out wrong.

> The system ABI of Linux really isn't the syscall interface, its the system libc.

You might have reasons to prefer to use libc; some software has reason to not use libc. Those preferences are in conflict, but one of them is not automatically right and the other wrong in all circumstances.

Many UNIX systems did follow the premise that you must use libc and the syscall interface is unstable. Linux pointedly did not, and decided to have a stable syscall ABI instead. This means it's possible to have multiple C libraries, as well as other libraries, which have different needs or goals and interface with the system differently. That's a useful property of Linux.

There are a couple of established mechanism on Linux for intercepting syscalls: ptrace, and BPF. If you want to intercept all uses of a syscall, intercept the syscall. If you want to intercept a particular glibc function in programs using glibc, or for that matter a musl function in a program using musl, go ahead and use LD_PRELOAD. But the Linux syscall interface is a valid and stable interface to the system, and that's why LD_PRELOAD is not a complete solution.

  • It's true that Linux has a stable-ish syscall table. What is funny is that this caused a whole series of Samsung Android phones to reboot randomly with some apps because Samsung added a syscall at the same position someone else did in upstream linux and folks staticly linking their own libc to avoid boionc libc were rebooting phones when calling certain functions because the Samsung syscall causing kernel panics when called wrong. Goes back to it being a bad idea to subvert your system libc. Now, distro vendors do give out multiple versions of a libc that all work with your kernel. This generally works. When we had to fix ABI issues this happened a few times. But I wouldn't trust building our libc and assuming that libc is portable to any linux machine to copy it to.

    • > It's true that Linux has a stable-ish syscall table.

      It's not "stable-ish", it's fully stable. Once a syscall is added to the syscall table on a released version of the official Linux kernel, it might later be replaced by a "not implemented" stub (which always returns -ENOSYS), but it will never be reused for anything else. There's even reserved space on some architectures for the STREAMS syscalls, which were AFAIK never on any released version of the Linux kernel.

      The exception is when creating a new architecture; for instance, the syscall table for 32-bit x86 and 64-bit x86 has a completely different order.

      6 replies →

  • The complication with the linux syscall interface is that it turns the worse is better up to 11. Like setuid works on a per thread basis, which is seriously not what you want, so every program/runtime must do this fun little thread stop and start and thunk dance.

    • Yeah, agreed. One of the items on my long TODO list is adding `setuid_process` and `setgid_process` and similar, so that perhaps a decade later when new runtimes can count on the presence of those syscalls, they can stop duplicating that mechanism in userspace.

You seem to be saying 'it was incorrect on Fuchsia, so it's incorrect on Linux'. No, it's correct on Linux, and incorrect on every other platform, as each platform's documentation is very clear on. Go did it incorrectly on FreeBSD, but that's Go being Go; they did it in the first place because it's a Linux-first system and it's correct on Linux. And glibc does not have any special privilege, the vdso optimizations it takes advantage of are just as easily taken advantage of by the Go compiler. There's no reason to bucket Linux with Windows on the subject of syscalls when the Linux manpages are very clear that syscalls are there to be used and exhaustively documents them, while MSDN is very clear that the system interface is kernel32.dll and ntdll.dll, and shuffles the syscall numbers every so often so you don't get any funny ideas.

> The system ABI of Linux really isn't the syscall interface, its the system libc.

Which one? The Linux Kernel doesn't provide a libc. What if you're a static executable?

Even on Operating Systems with a libc provided by the kernel, it's almost always allowed to upgrade the kernel without upgrading the userland (including libc); that works because the interface between userland and kernel is syscalls.

That certainly ties something that makes syscalls to a narrow range of kernel versions, but it's not as if dynamically linking libc means your program will be compatible forever either.

  • > That certainly ties something that makes syscalls to a narrow range of kernel versions

    I don't think that's right, wouldn't it be the earliest kernel supporting that call and onwards? The Linux ABI intentionally never breaks userland.

    • In the case where you're running an Operating System that provides a libc and is OK with removing older syscalls, there's a beginning and an end to support.

      Looking at FreeBSD under /usr/include/sys/syscall.h, there's a good number of retired syscalls.

      On Linux under /usr/include/x86_64-linux-gnu/asm/unistd_32.h I see a fair number of missing numbers --- not sure what those are about, but 222, 223, 251, 285, and 387-392 are missing. (on Debian 12.1 with linux-image-6.1.0-12-amd64 version 6.1.52-1, if it matters)

> And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases.

Go's runtime does go through the vDSO for syscalls that support it, though (e.g., [0]). Of course, it won't magically adapt to new functions added in later kernel versions, but neither will a statically-linked libc. And it's not like it's a regular occurrence for Linux to new functions to the vDSO, in any case.

[0] https://github.com/golang/go/blob/master/src/runtime/time_li...

Linux doesn't even have consensus on what libc to use, and ABI breakage between glibc and musl is not unheard of. (Probably not for syscalls but for other things.)

The proliferation of Docker containers seems to go against that. Those really only work well since the kernel has a stable syscall ABI. So much so that you see Microsoft switching to a stable syscall ABI with Windows 11.

It should be possible to use vDSO without libc, although probably a lot of work.

  • It's not that much work; after all, every libc needs to have its own implementation. The kernel maps the vDSO into memory for you, and gives you the base address as an entry in the auxiliary vector.

    But using it does require some basic knowledge of the ELF format on the current platform, in order to parse the symbol table. (Alongside knowledge of which functions are available in the first place.)

    • It's hard work to NOT have the damn vDSO invade your address space. Only kludge part of Linux, well, apart from Nagle's, dlopen, and that weird zero copy kernel patch that mmap'd -each- socket recv(!) for a while.

      1 reply →