← Back to context

Comment by mrkeen

1 day ago

> fork() is a relatively expensive system call; it must copy the entire process state (including memory) for the child process. Many optimizations have been made over the years, but a fork is still a fundamentally costly operation. To make things worse, a fork() call is often immediately followed by an exec(), which will discard all of that memory that was so carefully copied for the child.

It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.

This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.

That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.

Redis is the kind of process where this matters a lot, and while fork() doesn't copy the memory, it still needs to copy the page table. For a process holding tens of GBs of RAM, fork() can take a long time, and there's one every time Redis dumps its .rdb file or rewrites its binary log ("AOF").

Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...

On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.

No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.

> It's weird to leave out a mention of copy-on-write

For the intended audience of such a paper this is base knowledge.

Even with copy-on-write, fork() still has to pay the setup cost for COW. If the parent process has a lot of busy threads (e.g. Java), you can end up doing a lot of unnecessary COW before exec() fires.

  • Isn't that what vfork tried to address? No COW, the child starts in its parents address space and only gets its own after calling exec.

    • Yes, the next sentence in TFA is:

      > Attempts (such as vfork()) have been made over the years to optimize for this case, but the pattern still is more expensive than it could be.

      Basically vfork do a "stop the world".

      1 reply →

It says state. Copy on write still means it's O(number of page table entries) even if you don't copy the contents. It's a well known issue that forking a program with large virtual memory size is slow.

  • It says "(including memory)". It's pretty natural to read this as "(including the contents of allocated pages)".

  • On modern hardware a cow page copy should only take 1-5ms. Redis forks to save the db to disk and it's been a solid design choice.

    I guess it depends on how sensitive your application is to main thread pauses.