← Back to context

Comment by cryptonector

4 years ago

fork() is not trivial now. Processes are huge now -- they have huge heaps among other things. Copying all that is expensive. In the 80s we tried COW, but that turns out to be very slow as well. What operating systems do now is immediately copy the resident set, then do COW for the rest of writable memory, but in large, multi-threaded processes, this is still too slow.

Use vfork() or posix_spawn().

Hrm. Googling "fork linux copy-on-write" seems to find a lot of stack overflow answers from 2014-2015 claiming Linux marks pages as copy-on-write when fork() is called. I didn't see anything more recent in the first page of results.

I could see it being worthwhile to immediately copy a few pages, like the top of the stack, but copying the whole resident set seems excessive. Especially since some of that data might not even be written to.

  • So the problem is what happens to the old and new processes after the fork. To CoW, you need to mark all the pages read only in _both_ old and new which means that every memory write in the caller will now pagefault since the OS now has to lazily copy on both sides. So with true copy on write the fixed costs may be low but the marginal cost per memory write may be high in both parent and child. In this case you can see why the resident set is copied, yes? It’s the smallest amount of memory that guarantees predictable performance subsequent to the call returning.

    • If the parent is threaded and the host has more than one CPU, then fork() == TLB shootdowns, which are slow.

      As well there's the cost of all those page faults that the two processes are likely to take to do the copying.

      And lastly there's all sorts of complexity involving multiple parent threads calling fork(), or the child calling fork() again (or vfork()) before calling exec.

      It's just much easier to copy the resident set and mark the address space as being CoW, because now you only have to worry about page faults for pages that are not in core anyways and so were going to fault anyways, and that means you don't have to worry about TLB shootdowns either (if a page is not in core, it's not referenced by any TLB either). You still have the multi-fork issues, but now you can use an atomic reference count on the address space.

    • The classic pattern was that you'd spin up something like Apache, load it full of read-only data, then start forking its children. Hoping that you'd share memory between children.

      With what you describe, you'd share nothing because at the point you've loaded it up, all the data you just loaded is resident. :-(

      4 replies →

Hmmm, from <https://www.man7.org/linux/man-pages/man3/posix_spawn.3.html>:

    The posix_spawn() and posix_spawnp() functions provide the
    functionality of a combined fork(2) and exec(3), with some
    optional housekeeping steps in the child process before the
    exec(3).  These functions are not meant to replace the fork(2)
    and execve(2) system calls.  In fact, they provide only a subset
    of the functionality that can be achieved by using the system
    calls.

Also, there's no way to set resource limits in the child process, nor switch user or group ID, using posix_spawn().

Using fork() also means you end up with shared ownership of resources like file descriptors, which can have some pretty weird consequences.

  • This is true with all process creation APIs.

    Windows defaults to CLOEXEC semantics and you have to opt-in to child process inheriting open file handles, and that has caused problems.

    Unix defaults to not-CLOEXEC sematincs, and that too has caused problems.

    • The Windows default can cause problems because of simple logic bugs.

      The Unix default can cause unsolvable problems because of races between threads.

      You should use CLOEXEC everywhere. Except you can't because you are using libraries.

      1 reply →

    • closefrom() comes in handy for this. It's missing on some platforms (notably glibc and mac iirc) but actually not too hard to implement a work-alike.

      1 reply →

  • Or more importantly, IPC mechanisms like mutexes. If they're in shared memory, you now have two problems. The runtime of a very very popular scripting languages does this.