← Back to context

Comment by rom1v

1 day ago

Related to the discussion: "A fork() in the road": https://www.microsoft.com/en-us/research/wp-content/uploads/...

> ABSTRACT

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.

> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.

This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.

QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

  • I think fork() is more of a PDP-7 mistake than a PDP-11 mistake. On the original UNIX system, memory was so limited that the only sane partitioning was to write the running program's memory image to disk, then reuse the running image as the child. An immediate consequence is the UNIX I/O model, where disk I/O is always synchronous (can't swap processes while waiting for disk I/O because swapping processes requires disk I/O). Anyway, as soon as the UNIX group got a PDP-11, the model broke down, because they had enough memory for multiple processes, but fork() didn't allow them to run concurrently, because their first PDP-11 didn't have an MMU. So they whined until they got one with an MMU instead of fixing their broken design.

  • > It was needed back in the era of really expensive memory.

    Well, it seems we are back in an era with really expensive memory.

  • The QNX approach is also pretty much how the dynamic linker loads shared libraries today in Linux .

    “An era of really expensive memory”. That sounds familiar…

    • I think GP was saying that in QNX the spawning process was responsible for dynamically linking it's child process before running it. With Linux, I think it's the spawned process taking care of it's own dynamic linking.

      1 reply →

  • > > The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

    > No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.

    Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().

  • It is almost as if you agree with the authors ..

    "In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"

    (But thanks for the good explanation)

  • But why is having a pair of separate independent operations, fork and exec, required to achieve this? A single fexec call could be implemented to work in the way you describe, no?

  • Don’t pretty much all OSes implement process startup in userspace? On macOS, the kernel creates a process with an image of dyld and points it at dyld_start, which actually takes care of parsing the Mach-O header. I assumed ld.so does the same job on Linux.

  • Cygwin's fork() is similar to what you describe for QNX.

    • It's a fairly widespread idea for architectures that try to move things out of kernel mode. The Hurd does program image file loading in userspace, too, in its exec server(s).

      The tricky part is setting up the initial process. The way out for that is static linking and re-use of the fact that the operating system kernel loader has to understand and be able to load (at least a small subset of) program image file formats too.

  • > It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

    aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

  • The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

    Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

    • CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.

      At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.

      7 replies →

    • > The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1

      It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.

      On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.

      That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.

      I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.

      6 replies →

    • With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.

      vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.

      1 reply →

    • In addition to what you said: forking from a process running on multiple cores is slow once you have mark all pages as read-only and shoot this out to all cores. TLB synchronization is super expensive. Unix originally didn't support threads (want concurrency? just fork!) but with modern multicore that's clearly unsustainable.

    • The nice thing about fork+exec is that's its simple and flexible.

      To avoid the problems, see roc's comment under the article. Esp use of a zygote process.

    • Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?

      8 replies →

    • One os level thing that is interesting to me is if it would be possible/wise to make an OS based on (concurrent) garbage collection.

  • Because that OS best practices is to use threads.

    Traditionally Windows applications that create processes all the time come from UNIX heritage.

    Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

    While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

    • A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

      Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

      Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

      You're right that POSIX semantics get tangled when using threads.

      10 replies →

    • The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

      10 replies →

    • the only difference between a thread and a process on linux is how many structures they share. the function is identical.

      1 reply →

    • Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

      12 replies →

  • I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++

  • That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

    • Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.

      Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.

      2 replies →

Fork is marvelous for the zygote pattern

Hard to come up with an optimization that is equally efficient and elegant

  • The zygote pattern[1] is a great optimization to deal with the cost of forking, but IMHO, being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

    I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.

    [1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.

    • You're referring to something else, and maybe I'm using the term "zygote" incorrectly.

      In all uses of zygotes that I have seen, here's what's really happening:

      - `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.

      - To make this even faster, you have a pool of pre-forked processes sit around.

      - Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.

      So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).

      3 replies →

  • The paper explicitly covers it that various memory COW/snapshot mechanisms are probably faster and safer than the zygote pattern. As it stands getting the zygote pattern correct and safe is something you have to plan for upfront. You can’t retrofit it which is why the paper mentions it has poor composability. Also the advantages of the zygote pattern can be overstated since the memory sharing benefit is minimal since it has to happen so early and modern OSes already transparently CoW duplicate pages in the background.

  • And so easy to make into bottleneck.

    Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).

    If you don't, you might wake up with fork() causing latency issues.

  • Unless you want to create a thread in your zygote. Then it breaks down.

    Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

    • You can create threads in the zygote. It doesn't "break down", but sure, there's a bit more work.

      My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable, preferably lazily (they resume when they are actually needed). So, the zygotes are sitting with those threads suspended. When they become active, they can do work immediately. They might lazily resume those threads as needed.

      There are other idioms for this too.

      > Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

      Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives

Not sure if fork is outdated or not, but people calling it a “hack” obviously have pretty bad engineering taste.