Comment by adastra22

4 years ago

...is it?

In it is original implementation, fork() was pretty trivial. All it did was create a new process entry in the kernel table, with all the pages and capabilities and such copied from the original process. Then mark all pages as copy-on-write, and return to the caller. Maybe not trivial, but much less complicated than loading an executable file from disk.

My understanding of Linux internals is maybe 20 years out of date, so I am legitimately curious what makes fork() so complicated these days.

fork() is not trivial now. Processes are huge now -- they have huge heaps among other things. Copying all that is expensive. In the 80s we tried COW, but that turns out to be very slow as well. What operating systems do now is immediately copy the resident set, then do COW for the rest of writable memory, but in large, multi-threaded processes, this is still too slow.

Use vfork() or posix_spawn().

  • Hrm. Googling "fork linux copy-on-write" seems to find a lot of stack overflow answers from 2014-2015 claiming Linux marks pages as copy-on-write when fork() is called. I didn't see anything more recent in the first page of results.

    I could see it being worthwhile to immediately copy a few pages, like the top of the stack, but copying the whole resident set seems excessive. Especially since some of that data might not even be written to.

    • So the problem is what happens to the old and new processes after the fork. To CoW, you need to mark all the pages read only in _both_ old and new which means that every memory write in the caller will now pagefault since the OS now has to lazily copy on both sides. So with true copy on write the fixed costs may be low but the marginal cost per memory write may be high in both parent and child. In this case you can see why the resident set is copied, yes? It’s the smallest amount of memory that guarantees predictable performance subsequent to the call returning.

      6 replies →

  • Hmmm, from <https://www.man7.org/linux/man-pages/man3/posix_spawn.3.html>:

        The posix_spawn() and posix_spawnp() functions provide the
        functionality of a combined fork(2) and exec(3), with some
        optional housekeeping steps in the child process before the
        exec(3).  These functions are not meant to replace the fork(2)
        and execve(2) system calls.  In fact, they provide only a subset
        of the functionality that can be achieved by using the system
        calls.
    

    Also, there's no way to set resource limits in the child process, nor switch user or group ID, using posix_spawn().

  • Using fork() also means you end up with shared ownership of resources like file descriptors, which can have some pretty weird consequences.

    • This is true with all process creation APIs.

      Windows defaults to CLOEXEC semantics and you have to opt-in to child process inheriting open file handles, and that has caused problems.

      Unix defaults to not-CLOEXEC sematincs, and that too has caused problems.

      4 replies →

    • Or more importantly, IPC mechanisms like mutexes. If they're in shared memory, you now have two problems. The runtime of a very very popular scripting languages does this.

> Then mark all pages as copy-on-write, and return to the caller.

Unix actually copied the memory over initially.