Comment by anarazel

1 day ago

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

  • CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.

    At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.

    • CoW is probably a good idea regardless, yeah. Overcommit is more questionable. Regardless, both ought to be argued based on their own merits. It's unfortunate that both are necessary as a consequence of fork().

      6 replies →

  • > The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1

    It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.

    On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.

    That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.

    I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.

    • Another possible design is instead of forking the current process, you create a new empty process, then the parent calls syscalls to set up the new process, and eventually call exec on the child process. That does mean you either need new syscalls for that, or adapt existing syscalls to take a pidfd as an argument. That also solves some other problems with fork/exec where the default is to inherit a lot of things you probably don't want. With this, you can opt in to inheritance instead of having to opt out.

      Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...

      5 replies →

  • With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.

    vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.

    • vfork() helps a LOT. The restrictions on what you can do on the child-side of vfork() are pretty much the same ones as for fork() + you must not do anything to damage the stack frame of the vfork() caller (i.e., you can't return).

  • In addition to what you said: forking from a process running on multiple cores is slow once you have mark all pages as read-only and shoot this out to all cores. TLB synchronization is super expensive. Unix originally didn't support threads (want concurrency? just fork!) but with modern multicore that's clearly unsustainable.

  • The nice thing about fork+exec is that's its simple and flexible.

    To avoid the problems, see roc's comment under the article. Esp use of a zygote process.

  • Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?

    • Didn't I just say that "the problem with fork isn't really that it's slow"? It's all the other OS design choices it forces on you if you want it to be fast.

      1 reply →

    • We don't have any broadly used non-fork samples. Windows, macOS, and Linux all have fork. So the presence of fork can't be the reason for the performance difference.

      (Windows's fork is called ZwCreateProcess)

      5 replies →

  • One os level thing that is interesting to me is if it would be possible/wise to make an OS based on (concurrent) garbage collection.

  • > The problem with fork isn't really that it's slow.

    Did someone suggest that it was?

    • anarazel's comment focuses entirely on performance, indicating that they have an impression that the discussion about why fork is bad is about performance. I'm not entirely sure where this impression came from, as it's not mentioned in rom1v's quote nor a point in the linked paper, "A fork() in the road".

Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

  • A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

    Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

    Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

    You're right that POSIX semantics get tangled when using threads.

    • That's actually less accurate, not more. It's a post-hoc revision that conflates Unix with Linux.

      The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.

      The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.

      Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.

      Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.

      3 replies →

    • Well, Windows before NT isn't the same design as Windows 16 bit, it only shares the name for all practical purposes, and has more influence from OS/2 than Windows 16 bit.

      Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.

      Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.

    • whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

      How are those not simply child processes? I don't understand your use of the word 'threads' here.

      Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.

      4 replies →

  • The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

    • True, but on Windows the approach is then to use COM servers, which have a faster IPC model, and can even serve multiple clients, depending on how the appartement space is configured.

      9 replies →

  • the only difference between a thread and a process on linux is how many structures they share. the function is identical.

  • Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

    • This is not true. NT never had fork, was always based on the assumption of an MMU and Dave Cutler was a well known fork hater in the 80s long before this paper came out and made it cool to be so. By the time Windows 95 was out, the baseline was 386 with an MMU. CreateThread was initially designed for NT in 1993 though (which didn’t support pre-386 CPUs).

      5 replies →

    • NT was designed to be platform-agnostic, and its original target was the DEC Alpha. Its process model owes nothing to pre-386 CPUs. The WinAPI CreateProcess function is a layer atop NtCreateProcess, so that is where the pre-386 heritage lives. But even the WinAPI process model changed significantly with 32-bit Windows.

      2 replies →

I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++

That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

  • Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.

    Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.

    • Okay but people don't claim that copying the instruction pointer (a single machine register) is the reason for any speed difference. They claim it's due to the memory sharing. And that's easily disproven since you can share pages, just like on Linux, simply by passing null for the section handle, yet there's still a performance difference.

      Why does it matter which prefix I used? They both point to the same routine so my point applies either way.

    • It's a completely uncontroversial fact that NT does implement fork(). Turn to page 183 of Helen Custer's "Inside Windows NT" and you will read about it.