← Back to context

Comment by swiftcoder

7 years ago

Fork has really weird semantics, and a lot of fun gotchas around managing resources. Good riddance?

Not even just the semantics, the performance is awful. Even when the fork is virtual (as any modern fork is) and there's no memory copying because it's COW, all the kernel page tables still need to be copied and for a multi-GB process that's nontrivial. That's why any sane large service that needs to fork anything will early on start up a slave subprocess whose only job is to fork quickly when the master process needs it.

  • >all the kernel page tables still need to be copied and for a multi-GB process that's nontrivial

    Only in the pathological case where the large process is backed solely by the 4kb pages. The hardware has long now supported large pages - on x86 since Pentium Pro, if memory serves - and huge pages. The popular OSes (Linux 2.6+ and Windows 2003+) also do support large and huge pages. A 2GB process can easily be three pages: r/x code, r/w stack, r/w data (2gb). Granted, it gets a bit more complex if mmapped I/O or JIT are used, but since both are mature technology now, it's fine to point fingers at any inefficiency and demand better. Another caveat would probably be shared libraries loading at separate address ranges, which, IMO, is another reason to ditch shared libraries for good.

    Contrary to popular wisdom, OS research is still relevant.

    • You want to ditch shared libraries and mmap to map your big processes using GB pages to make fork fast again (despite it not being the main and only drawback)???

      OS research might be relevant, and it's good that some people have wild idea, but honestly I doubt this one will go anywhere :P

      4 replies →

    • > Contrary to popular wisdom, OS research is still relevant.

      Is it really popular wisdom though, or is it the opinion of one person and it got hyped up, much like the same hype happened on a subpar programming language that same person worked on?

  • > That's why any sane large service that needs to fork anything will early on start up a slave subprocess whose only job is to fork quickly when the master process needs it.

    I don't think that's (entirely) true. This is more because a large service with some potent master process will have said process Do Stuff(tm) that will involve opening files, threads, signal handling, or whatever things that need to be taken care of one way or the other when forking to a worker (or whatever other child) process. It's therefore much simpler to fork a master subprocess into a child spawner earlier on, when it has yet to do anything. You significantly reduce your chances of screwing up if you have nothing to clean up for.

    • That's true, it's not the only reason. Dealing with threads and buffers and pthread_atfork and the associated heartbreak is a biggie also. But the performance is nothing to laugh at.

      I just did a quick test, a 100mb process generally takes >2ms to fork, while a 1mb or less process takes 70us. It seems like its pretty much linear with process size.

      2 replies →

  • The performance is awful, but in return you get the COW memory you mentioned. That's a pretty huge benefit for a lot of programs with huge, seldom-changing memory state at startup. If those programs want to parallelize themselves without duplicating that memory or paying startup time/CPU overhead, fork() is a pretty handy way to achieve that.

  • These days, you can usually start a process without forking through posix_spawn/vfork. Although, I gather some servers still do it so they can set the current working directory more easily.

Between the `fork()` and an `exec()`, I can:

    * redirect stdin, stdout, and stderr
    * open files that might be needed and close files that aren't
    * change process limits
    * drop privileges 
    * change the root directory
    * change namespaces

And there are a few other things I am probably forgetting.

  • And ideally all these things become properties to a configuration object which is then used to spawn a process.

    • I'm not sure I agree it's the ideal way to do it. That's a heck of a lot of work for one function to do, and it necessarily duplicates the functionality of a ton of other functions. And that's ignoring the fact that forking without ever exec'ing can be really useful in many cases.

      I haven't yet read the paper, but considering the incredible simplicity from the programmer's PoV that fork provides, and the fact that at least Linux makes it pretty god damn fast, especially compared to Windows' non-forking model, I can't really see myself agreeing with their conclusion.

      13 replies →