The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
Although it does say that vfork() is difficult to use safely, while the gist recommends it? I think there is still some clarity needed around the use cases.
Fork today is a convenient API for a single-threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them. In other words, a shell. It’s no surprise that the Unix shell was the first program to fork [69], nor that defenders of fork point to shells as the prime example of its elegance [4, 7]. However, most modern programs are not shells. Is it still a good idea to optimise the OS API for the shell’s convenience?
As u/amaranth pointed out, my gist predates the MSFT paper, which mostly explains why I didn't reference. Though, to be fair, I saw that paper posted here back in 2019, and I commented on it plenty (13 comments) then. I could have edited my gist to reference it, and, really, probably should have. Sometime this week I will add a reference to it, as well as this and that HN post, since they are clearly germane and useful threads.
I vehemently disagree with those who say that vfork() is much more difficult to use correctly than fork(). Neither is particularly easy to use though. Both have issues to do with, e.g., signals. posix_spawn() is not exactly trivial to use, but it is easier to use it correctly than fork() or vfork(). And posix_spawn() is extensible -- it is not a dead end.
My main points are that vfork() has been unjustly vilified, fork() is really not good, vfork() is better than fork(), and we can do better than vfork(). That said, posix_spawn() is the better answer whenever it's applicable.
Note that the MSFT paper uncritically accepts the idea that vfork() is dangerous. I suspect that is because their focus was on the fork-is-terrible side of things. Their preference seems to be for spawn-type APIs, which is reasonable enough, so why bother with vfork() anyways, right? But here's the thing: Windows WSL can probably get a vfork() added easily enough, and replacing fork() with vfork() will generally be a much simpler change than replacing fork() with posix_spawn(), so I think there is value in vfork() for Microsoft.
Use cases for vfork() or afork()? Wherever you're using fork() today to then exec, vfork() will make that code more performant and it generally won't take too much effort to replace the call to fork() with vfork(). afork() is for apps that need to spawn lots of processes quickly -- these are rare apps, but uses for them do arise from time to time. But also, afork() should be easier to use safely than vfork(). And, again, for Microsoft there is value in vfork() as a smaller change to Linux apps so they can run well in WSL.
BTW, see @famzah's popen-noshell issue #11 [0] for a high-perf spawn use case. I linked it from my gist, and, in fact, the discussion there led directly to my writing that gist.
You see, an operating system as commonly conceived has at least two major jobs:
- abstract away underlying hardware
- safely multiplex resources
And do the above with as little overhead as possible.
Now the thing is: whenever you have multiple goals, you need to make trade-offs, and you aren't as good at any one goal as you could be.
So the exokernel folks made a suggestion in the 90s: let the OS concentrate on safely multiplexing resources, and do all the abstracting in user level libraries.
Normal application programming would mostly look the same as before, your libraries just do more of the heavy lifting. But it's much easier to swap out different libraries than it is to swap out kernel-level functionality.
That vision never caught on with mainstream OSes. But: widespread virtualisation made it possible. You can see hypervisors like Xen as exokernel OSes that do the bare minimum required to safely multiplex, but don't provide (many) abstractions.
Shells have relatively simple operational models, so _any_ API would probably be workable for shells.
Meanwhile, programs with more complex requirements have to work around these APIs. And many programs call other programs, or otherwise have to do tricky process lifecycle management.
The lowest-level APIs should, in theory, cater to the most complex cases, not to the simplest ones. This doesn't prevent a simpler API from existing, but catering to a simple use case in the primitives does hinder more complex needs.
(I think the more nuanced point is that the OS itself might not have a much better design available in any case. Unixes have a lot of neat stuff, but it's a lot of "design by user feature request", and "standardize 4 slightly different ways of doing things", so there is a lot of weirdness and it's hard to have The Perfect API in that case)
> Yes, but why is this characterized as something negative?
Unfortunately, the text does not provide sufficient context. Shell are not properly supported in any OS (probably except plan9), since 1. the OS provides no enforcement or convention of CLI API interface (there is no enforced encoding standard or checkable stuff), 2. the OS provides no rules for file names to be shell-friendly and 3. there are no dedicated communication channels towards shells or in between programs and shells.
So all in all, shells remain a hack around the system that is "simple to implement the initials" and is annoying to use and write at many corner cases.
> Shells simply developed features that users required of them.
Cross out "simply" and call it convenience+arbitrary complex scripting glue for 4 main goals:
1. piping
2. basic text processing
3. basic job control
4. path hackery
In Ninja, which needs to spawn a lot of subprocesses but it otherwise not especially large in memory and which doesn't use threads, we moved from fork to posix_spawn (which is the "I want fork+exec immediately, please do the smartest thing you can" wrapper) because it performed better on OS X and Solaris:
The issue with posix_spawn is that you can't close all descriptors before exec. This is especially an issue as most libraries are still unaware they need to open every single handle with the close-on-exec flag set.
> Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec() and _spawn(), the last being a Windows-ism.
I appreciate this quite a bit. Vocal Unix proponents tend to believe that anything Unix does is automatically better than Windows, sometimes without even knowing what the Windows analogue is. Programming in both is necessary to have an informed opinion on this subject.
The one thing I miss most on Unix: the unified model of HANDLEs that enables you to WaitOnMultipleObjects() with almost any system primitive you could want, such as an event with a socket (blocking I/O + a shutdown notification) in one call. On Unix, a flavor of select() tends to be the base primitive for waiting on things to happen, which means you end up writing adapter code for file descriptors to other resources, or need something like eventfd.
Things I don't miss from Windows at all: wchar_t everywhere. :)
- SIDs
- access tokens
(like struct cred / cred_t in Unix kernels,
but exposed as a first-class type to user-land)
- security descriptors
(like owner + group mode_t + ACL in Unix land,
but as a first-class type)
- HANDLEs, as you say
- HANDLEs for processes
Many other things, Windows got wrong. But the above are far superior to what Unix has to offer.
Superficial silliness like allocating 48 bits to encode integers in [0,18] aside, what problem do structured SIDs actually solve? I’ve been trying to figure that out for the last couple of days and I still don’t get it, possibly because the Windows documentation doesn’t seem to actually say it anywhere.
I completely agree with having UUIDs or something in that vein for user and group IDs and will not dismiss IDs for sessions and such in the same namespace (although haven’t actually seen a use case for those), but structured variable-length SIDs as NT defines them just don’t make sense to me.
These decisions here are all older than Windows and weren't in reaction to them. It's in reaction to the awful mainframe ways to spawn processes like using JCL.
We've sort of come back to that with kubernetes yaml files to describe how to launch an executable in a specific env and all of the resources it needs. Like it can be traced explicitly, the Borg paper references mainframes and knowingly calls the language that would be replaced by kubernetes's yaml files 'BCL' instead of z/OS's JCL.
Plan9 is a lot older than Kubernetes and has the same namespacing of all processes. So it's not impossible to have a "*nix like" OS that still has mainframe-like separation of concerns to ease deployment.
Having written server software that had to work in both places, I always loved the simplicity of fork(2) / vfork(2) relative to Windows CreateProcess. Threading models in Win32 were always a pain. Which only got worse with COM (remember apartment threading? rental threading? ugh)
Back in the 90's, processes had smaller memory footprint, and every UNIX my software supported had COW optimizations. So the difference between fork(2) and vfork(2) were not very large in practice. Often, the TCP handshake behind the accept(2) call was of more concern than how long it would take fork(2) to complete. Of course, bandwidth has increased by a factor of 1000 since then, so considerations have changed.
It's how CreatProcess handles commandline argument that infuriates me - not as an argv array but a big string. It's so difficult to work around quoting.
The problem with WaitForMultipleObjects (WFMO) is that it's limited to 64 handles, which basically makes it useless for anything where the number of handles is dynamic as opposed to static. There are ways to get around this limitation by grouping handles into trees, but it's tremendously clunky.
UCS-2 seemed like a good(ish) idea at the time when Unicode's scope didn't include every possible human concept represented in icon form and UTF-8 hadn't yet been spec'd on a napkin by the first adults to bother thinking about the problem.
Even in 1989, it should have been clear that 16 bits were not enough to encode all of the Chinese characters, let alone encoding all the human scripts. Unicode today encodes 92,865 Chinese characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).
The only reason anybody would think of UCS-2 was a good idea was that they did not consult a single Chinese or Japanese scholar on Chinese characters.
where `ptr` might be an index into a table (much like a file descriptor) or maybe a pointer in kernel-land (dangerous sounding!) and `verifier` is some sort of value that can be used by the kernel to validate the `ptr` before "dereferencing" it.
On Unix the semantics of file descriptors are dangerous. EBADF can be a symptom of a very dangerous bug where some thread closed a still-in-use FD then a open gets the same FD and now maybe you get file corruption. This particular type of bug doesn't happen with HANDLEs.
Since you said anything... This is not strictly related to the article but your expertise seems to be in the right area.
I have a process that executes actions for users, at the moment that process runs as root until it receives a token indicating an accepted user, then it fork()s and the fork changes to the UID of the user before executing the action.
Is there a better way? I hadn't actually heard of vfork() before reading this article. I'm guessing maybe you could do a threaded server model where each thread vfork()s. I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
If the parent is threaded, then yes, vfork() will be better. You could also use posix_spawn().
As to "becoming a user", that's a tough one. There are no standard tools for this on Unix. The most correct way to do it would be to use PAM in the child. See su(1) and sudo(1), and how they do it.
> I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
Yes, fork() only copies the calling thread. The other threads' stacks also get copied (because, well, you might have pointers into them, who knows), but there will only be one thread in the child process.
vfork() also creates only one thread in the child.
There used to be a forkall() on Solaris that created a child with copies of all the threads in the parent. That system call was a spectacularly bad idea that existed only to help daemonize: the parent would do everything to start the service, then it would forkall(), and on the parent side it would exit() (or maybe _exit()). That is, the idea is that the parent would not finish daemonizing (i.e., exit) until the child (or grandchild) was truly ready. However, there's no way to make forkall() remotely safe, and there's a much better way to achieve the same effect of not completing daemonization until the child (or grandchild) is fully ready.
In fact, the daemonization pattern of not exiting the parent until the child (or grandchild) is ready is very important, especially in the SMF / systemd world. I've implemented the correct pattern many times now, starting in 2005 when project Greenline (SMF) delivered into OS/Net. It's this: instead of calling daemon(), you need a function that calls pipe(), then fork() or vfork(), and if fork(), and on the parent side then calls read() on the read end of the pipe, while on the child side it returns immediately so the child can do the rest of the setup work, then finally it should write one byte into the write side of the pipe to tell the parent it's ready so the parent can exit.
What about fork(2) for network servers? I've written parallel network servers two ways; open the socket to listen on and call fork() N times for the desired level of parallelism, and just create N processes and use SO_REUSEPORT. I prefer the former. I suppose there is hidden option C of "have a simple process that opens the listening port and then vfork/execs each worker" I find that to be a bit strange because the code will be split into "things that happen before listening on the port" (which includes, e.g. reading configuration files) and "things that happen after listening on the port" (which includes, e.g. reading configuration files)
It's a bit opinionated. It's meant to get a reaction, but also to have meaningful and thought-provoking content, and I think it's correct in the main too. Anyways, hope you and others enjoy it.
fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.
vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.
clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.
And I don't see what afork gets you that clone doesn't, except afork isn't as general.
> fork(2) makes a lot more sense when you realize its heritage.
I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.
It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.
The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)
It's 100% the wrong API for spawning processes.
Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.
The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.
IMO clone looks a lot better than screwing with that giant struct and all of the kernel bugs that would exist from validating every goofy way those options could be setup wrong by user space.
The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.
I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?
> Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
clone was literally designed to support posix threads.
> What's the L4 approach?
Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.
> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().
What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?
A: It just wouldn't happen.
And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:
/* do a bunch of pre-vfork() setup */
...
pid_t pid = vfork();
if (pid == -1) err(1, "Couldn't vfork()");
if (pid == 0) {
/* do a bunch of child-side setup */
execve(...);
/* oops, ENOENT or something */
_exit(1);
}
/* the child either exec'ed or exited */
if (waitpid(pid, &status, 0) != pid) err(1, "...");
...
How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.
No, really, what you say about vfork() is lore, and very very wrong.
That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.
> And I don't see what afork gets you that clone doesn't, except afork isn't as general.
afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
> What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that
Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.
> No, really, what you say about vfork() is lore, and very very wrong.
Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".
> afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
Ok, but I'm not going to hold it against clone for being a more general solution.
> clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.
A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.
Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.
For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.
That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.
Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..
I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?
vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.
I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.
The code I currently work on actually has a use of `clone` with the `CLONE_VM` flag to create something that isn't a thread. Since `CLONE_VM` will share the entire address space with the child (you know, like a thread does!) a very reasonable response would be "WAT?!"
What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
We achieved this by using `CLONE_VM` (and a handful of other flags) to give the new "thread-like" entity access to the whole address space. But, we omitted `CLONE_THREAD`, as if we were making a new process. The new "thread-like" entity would not technically be part of the same thread group but would live in the same address space.
We also used two chained `clone()` calls (with the intermediate exiting, like when you daemonise) so that the new "thread-like" wouldn't be a child of the original process.
All this existed before I joined, it's just really cool that it works. I've never encountered a such a non-standard use of clone before but it was the right tool for this particular job!
> What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
Sure! I'll try to illustrate the general idea, though I'm taking liberties with a few of the details to keep things simple(r).
Our software (see https://undo.io) does record and replay (including the full set of Time Travel Debug stuff - executing backwards, etc) of Linux processes. Conceptually that's similar to `rr` (see https://rr-project.org/) - the differences probably aren't relevant here.
We're using `ptrace` as part of monitoring process behaviour (we also have in-process instrumentation). This reflects our origins in building a debugger - but it's also because `ptrace` is just very powerful for monitoring a process / thread. It is a very challenging API to work with, though.
One feature / quirk of `ptrace` is that you can't really do anything useful with a traced thread that's currently running - including peeking its memory. So if a program we're recording is just getting along with its day we can't just examine it whenever we want.
First choice is just to avoid messing with the process but sometimes we really do need to interact with it. We could just interrupt a thread, use `ptrace` to examine it, then start it up again. But there's a problem - in the corners of Linux kernel behaviour there's a risk that this will have a program-visible side effect. Specifically, you might cause a syscall restart not to happen.
So when we're recording a real process we need something that:
* acts like a thread in the process - so we can peek / poke its memory, etc via ptrace
* is always in a known, quiescent state - so that we can use ptrace on it whenever we want
* doesn't impact the behaviour of the process it's "in" - so we don't affect the process we're trying to record
* doesn't cause SIGCHLD to be sent to the process we're recording when it does stuff - so we don't affect the process we're trying to record
Our solution is double clone + magic flags. There are other points in the solution space (manage without, handle the syscall restarting problem, ...) but this seems to be a pretty good tradeoff.
2. Set it up from the parent process, it just lies on the operating table passively.
3. Submit it to the scheduler.
This is just....obviously correct. Totally flexible. Totally efficient. Hell, if you really want to fork anything, fork those embryonic process which have no active threads! Much safer and easier to understand!
When I was first learning about UNIX and similar OSes I just assumed that this is how things worked because this is the obvious way of doing it. Why would you fork a process, then try to determine which of the two processes you are, then fix whatever the parent process messed up in your global state, and only then execute what you actually wanted to do? That seems insane (I guess until you realize that the main use case is creating /bin/sh).
But even when writing /bin/sh, I don't see why this would get in the way? I was once told earlier Unix didn't even have fork, but something more purpose-made for shells instead.
Sounds a bit like fuchsias launchpad library where you create launchpad object, do all the setup, and then call launchpad_go to actually start the process. Launchpad doesn't allow arbitrary syscalls in the setup, so in that sense it is maybe closer to "spawn" interface but with better ergonomics
I was always disappointed by the performance of fork()/clone().
CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.
But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.
And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.
I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.
CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.
Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)
I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.
My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.
The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.
When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.
Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.
Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.
It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.
However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.
> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.
> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.
POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.
Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.
> clone() is stupid ... the clone(2) design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.
IMHO a bigger problem [2] in practice with clone is that (according to glibc maintainers) once your program calls it, you can't call any glibc function anymore. [1] Essentially the raw syscall is a tool for the libc implementation to use. The libc implementation hasn't provided a wrapper for programs to use which maintains the libc's internal invariants about things like (IIUC) thread-local storage for errno.
The author's aforkx implementation is something that glibc maintainers could (and maybe should) provide, but my understanding is that you can get in trouble by implementing it yourself.
[2] editing to add: or at least a more concrete expression of the problem. Wouldn't surprise me if they haven't provided this wrapper in part because the proliferation the author mentioned makes it difficult for them to do so.
It's really unfortunate that the sanctioned way to call Linux syscalls directly is via the syscall() function (previously the _syscallN macros), and both of those methods set errno on error, which fails in a clone() thread.
If only Glibc provided a syscall_r() or something that returns the raw return value whether it's an error or not.
It is possible to make syscall() (and regular libc syscalls like read()) work in a clone() thread. I use this in performance-optimised I/O code in a database engine, so I know it works, but it requires some ugly Glibc-and-architecture-specific things. Doing it portably doesn't seem to be an option.
The problem with this argument is that the set of programs that just fork() and then exec() is fairly small. Sure, shells are small and do this, but then the article argues that shells are a good use of fork().
In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done (maybe you want to create a new pid ns, you need a separate mm because you're going to allocate a bunch, whatever). Maybe the argument is that programs should never do this? I don't buy that. Then there's a lot of string-slinging through exec().
That's backwards from my experience, which is that most users of fork() only do "fork; child does small amount of setup, eg closing file descriptors; exec". Shells are one of the few programs that do serious work in the child, because the POSIX shell semantics surface "create a subshell and do..." to the shell user, and then the natural way to implement that when you're evaluating an expression tree is "fork, and let the child process continue evaluating as a long-lived process continuing to execute as the same shell binary". (Depending on what's in that sub-tree of the expression, it might eventually exec, but it equally might not.)
Many years back I worked on an rtos that had no fork(), only a 'spawn new process' primitive (it didn't use an MMU and all processes shared an address space, so fork would have been hard). Most unixy programs were easy to port, because you could just replace the fork-tweak-exec sequence with an appropriate spawn call. The shells (bash, ash I think were the two I looked at) were practically impossible to port -- at any rate, we never found it worth the effort, though I think with a lot of effort and willingness to carry invasive local patches it could have been done.
The vast majority of programs that fork are doing fork() followed almost immediately by exec(), to the extent that on macOS for example a process is only really considered safe for exec() after fork() happens. Pretty much nothing else is considered safe.
Yeah; that would be my assumption too. I worked one time on a significant project that benefit from fork() without exec() and it was a monstrous pain - only if you own every single line of code in your project, have centralized resource management, and have no significant library dependencies should you ever consider doing this.
Oh no, there's tons of ProcessBuilder type APIs in Java, Python, and... every major language you can think of.
The problems with fork() become very apparent in any Java apps that try to run external programs, especially in apps that have many threads and massive heaps and are very busy.
> In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done
That's usually going to be done with clone() instead, no? You'll likely want to fiddle with the various flags for those usages and are unlikely to be happy with what fork() otherwise does.
That paper smacks of a Chesterton Fence. They haven't come up with a tested replacement for many of the use cases, i.e.:
These designs are not yet general enough to cover all the use-cases outlined above, but perhaps can serve as a starting point...
yet bullet #1 in the next paragraph is
Deprecate Fork
I think this is a case of security guys being upset about fork gumming-up their experiments. I don't really care about their experiments. The security regime for the past 20 years may have bought us a little more security against eastern bloc hackers, but it hasn't done squat to protect us from Apple, Google, & Microsoft! I have never had a virus de-rail my computing life as much as the automatic Windows 10 upgrade. Robert Morris got 400 hours community service for a relatively benign worm. If that's the penalty scale, Redmond should get actual time in the slammer for Cortana, forced Windows Update, and adding telemetry to Calculator.
You fail to address any of the substance of their paper, or of my gist (TFA), then go on a rant about unrelated things. The authors of that paper deserve better treatment even if you hate Microsoft.
I have to disagree that fork is evil. fork is great because of copy-on-write. I guess my particular use case is not very typical/common though.
I'm running powerflow simulations on a power grid model (several GB of memory to store the model). Copy-on-write means I can make small modifications to this model and run simulations in parallel. Thanks to fork/copy-on-write, I can run 32 simulations in parallel, each will small modifications without requiring 32 times as much memory.
I saw a bug once where an application would get way slower on MacOS after calling fork(). Not just temporarily either; many syscalls would continue to run slowly from the call to fork() until the process exited.
Looking on Stack Overflow, I see a few reports of this behavior[0][1].
I don't think containers should be like jails. Containers should be more like chroots than they are now.
Have you ever tried to run a modern X/whatever app with 3D graphics and audio and DBUS and God knows what else in a container and get it to show up on your desktop? It's a fucking nightmare. I spent over a week trying to get 1Password to run in a container. Somebody decided containers had to be "secure", even though they don't actually exist as a single concept and security was never their primary purpose. If instead containers were used only to isolate filesystem dependencies, we could actually pretend containers were like normal applications and treat them with the same lack of security concern that all the rest of our non-containerized programs are.
Firecracker is the correct abstraction for isolation: a micro-VM. That is the model you want if you want to run an app securely (not to mention reliably, as it can come with its own kernel, rather than needing you to run a compatible host kernel).
I... didn't mean that containers have to have a copy of the operating system inside them, systemd and many other things included. I meant only that they should be created in ways like how the BSDs and Illumos do it.
Is it a fair point to implement first with fork() because of memory protection, then optimize by using benchmarks and potentially vfork() for speed? Benchmark areas can look at synchronous locks, copy-on-write memory, stack sharing, etc.
What are the good practices of security tradeoffs of fork() vs. vfork() especially in terms of ease of writing correct code? I'd thought that fork() + exec() tends to favor thinking about clearer separation/isolation. For example I've written small daemons using fork() + exec() because it seems safe and easy to do at the start.
In short, fork() mixes poorly with multi-threaded code (and has some security footguns like needing to explicitly unshare elements of environment which may be sensitive, such as file descriptors (suddenly you need to know all the file descriptors used in the whole program from a single place in code)). Here is a well-written comment about fork() from David Chisnall: <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>
Additionally, the fork()+exec() idiom practically forces OS designers into a corner where they simply have to implement Copy-on-Write for virtual memory pages, or otherwise the whole userspace using this idiom is going to be terribly slow. Without the fork()+exec() idiom you don't need CoW to be efficient.
Fork mixes so poorly with multithreaded code that a lot of modern languages that are built from the beginning with threads of one sort or another in mind, like Go, simply won't let you do it. There is no binding to fork in the standard library.
I think you could bash it together yourself with raw syscalls, because that can't really be stopped once you have a syscall interface, but basically the Go runtime is built around assuming it won't be forked. I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out. The lowest level option given in the syscall package is ForkExec: https://pkg.go.dev/syscall#ForkExec And this is a package that will, if you want, create new event loops outside of the Go runtime's control, set up network connections outside of the runtime's control, and go behind the runtime's back in a variety of other ways... but not this one. If you want this, you'll be looking up numbers yourself and using the raw Syscall or RawSyscall functions.
TL;DR if another thread is holding a lock when you fork that lock will be stuck locked in the child, but that thread that was using that lock no longer exists.
So if your multi-threaded program uses malloc you may fork while a global allocation lock is being held and you won't be able to use malloc or free in the child (thread-local caches aside).
There are other problems but this is the basic idea. To be fork-safe you need to allow any thread to just disappear (or halt forever) at any point in your program.
Apologies if this is a silly question, but it seems like there's a false dichotomy here:
(1) You have separate fork() (etc.) and exec(), so that in the brief window in between you can set all the properties of the new process using APIs that exist anyway for controlling your own process.
(2) You have a single call to spawn a new process, but you have a million different options to control every aspect of the new process.
Why not do it this other way instead? Perhaps a bit late now but seems like in retrospect it would give the API simplicity of fork+exec without any of the complications.
(3) There are two steps to run a new process. The first fully sets up its memory and returns a PID, but doesn't start running it. The second call, unfreeze(), allows it to begin executing code. All the usual APIs that exist anyway for controlling your own process take an extra parameter specifying the PID of a frozen child (or -1 for the current process).
There is something about fork which I have never understood. Maybe someone here can explain it to me.
Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork). If you know that you're going to call exec immediately after fork, then all the issues of dealing with the (potentially large) address space of the parent just evaporate because the child process is just going to immediately discard it all.
So why is there not a fork-exec combo? And why has it not replaced fork for 99% of use cases?
And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
Because there are many, many use cases where you don't want to call exec() immediately after fork().
Want to constrain memory usage or CPU time of an arbitrary child process? You have to call setrlimit() before exec(). Privilege separation? Call setuid() before exec(). Sandbox an untrusted child process in some way? Call seccomp() (or your OS equivalent) before exec(). And so on and so forth. Any time you want to change what OS resources the child process will have access to, you'll need to do some set-up work before invoking exec().
Windows solves this by adding a bunch of optional parameters to CreateProcess, as well as having two more variants (CreateProcessAsUser and CreateProcessWithLogon). Some of the arguments are complicated enough that they have helper functions to construct them.
I like the more composable fork()->modify->exec() approach of unix, but I wouldn't call either of them really elegant.
To me this feels like a call for more powerful language primitives. i.e. a way to specify some action to take to "set up" the child process that's more explicit and readable than one special behaving in a particularly odd way. I'm imagining closures with some kind of Rust-like move semantics, but not entirely sure.
(if we're speaking in terms of greenfield implementation of OS features)
"Process control in its modern form was designed and implemented within a couple of days. It is astonishing how easily it fitted into the existing system; at the same time it is easy to see how some of the slightly unusual features of the design are present precisely because they represented small, easily-coded changes to what existed. A good example is the separation of the fork and exec functions. The most common model for the creation of new processes involves specifying a program for the process to execute; in Unix, a forked process continues to run the same program as its parent until it performs an explicit exec. The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else."
OK, but why has it not be replaced with something better in the intervening 50 years? There have been a lot of improvements to unix since 1970. Why not this?
I think the reason for fork() and exec() as primitives goes back to the early days Unix design philosophy. Unix tends to favour "easy and simple for the OS to implement" rather than "convenient for user processes to use". (For another example of that, see the mess around EINTR.) fork() in early unix was not a lot of code, and splitting into fork/exec means two simple syscalls rather than needing a lot of extra fiddly parameters to set up things like file descriptors for the child.
There's a bit on this in "The Evolution of the UNIX Time-Sharing System" at https://www.bell-labs.com/usr/dmr/www/hist.html -- "The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else." It says the initial fork syscall only needed 27 lines of assembly code...
(Edit: I see while I was typing that other commenters also noted both the existence of posix_spawn and that quote...)
> Unix tends to favour "easy and simple for the OS to implement"
Well, yeah, but the whole problem here, it seems to me, is that fork is not simple to implement precisely because it combines the creation of the kernel data structures required for a process with the actual initiation of the process. Why not mkprocess, which creates a suspended process that has to be started with a separate call to exec? That way you never have to worry about all the hairy issues that arise from having to copy the parent's process memory state.
Long ago in the far away land of UNIX, fork was a primitive because the primary use of fork was to do more work on the system. You likely were one of thee or four other people, at any given moment vying for CPU time, and it wasn't uncommon to see loads of 11 on a typical university UNIX system.
> so why is there not a fork-exec combo
you're looking for system(3). Turns out, most people waitpid(fork()). Windows explicitly handles this situation with CreateProcess[0] which does a way better job of it than POSIX does (which, IMO, is the standard for most of the win32 API, but that's a whole can of worms I won't get into).
> why would anyone ever use vfork?
Small shells, tools that need the scheduling weight of "another process" but not for long, etc. See also, waitpid(fork()).
When you have something with MASSIVE page tables, you don't want to spend the time copying the whole thing over. There's a huge overhead to that.
system(3) is not a good alternative because it indirects through the shell, which adds the overhead of launching the shell as well as the danger of misinterpreting shell metacharacters in the command if you aren’t meticulous about escaping them correctly.
`fork` is a classic example, as others have mentioned, as something that was implemented because it was [at the time] easy rather than because it was a good design. In the decades since, we've found there are issues that are caused by the semantics of fork, especially if the most common subsequent system call is `exec`.
If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
POSIX specifies a fork+exec combo called posix_spawn. This is actually used a fair amount, but the reason it isn't used more is because it doesn't support all of the eleventy-billion parameters governing all of the attributes of the spawned process. Instead, these parameters are usually set by calling system calls that change these parameters between fork and exec. These system calls might, for example, change the root directory of a process or attach a debugger. Neither of these are supported by posix_spawn, which only allows the common operations of changing the file descriptors or resetting the signal mask in the list of actions to do.
And this suggests why you might want vfork: vfork allows you write something that looks like posix_spawn: you get to fork, do your new-process-attribute-setting-flags, and then exec to the new process image, all while being able to report errors in the same memory space.
> If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
Or if you happen to be sane you'll have a single, simple system call to create a blank, suspended child process, and all the regular system calls which operate on process state will take a handle or process "file descriptor" to indicate which process to modify rather than assuming the current process as the target.
This was the ultimate flaw of posix_spawn(). As you point out it doesn't support all the things you might want to tweak in the child process—a consequence of trying to capture every aspect of the initial process state in a single process-creation API rather than distributing the work through the normal system calls so that each new interface or state can be adjusted for child processes in the same way that it's adjusted for the current process.
Whatever you do, though, make sure it's possible to emulate fork() reliably with your "better" replacement. Consider the case of Cygwin where emulated fork() calls can (and frequently do) fail in bizarre ways because the "blank" child process was pre-loaded with some unexpected virtual memory mapping by AV software or other system tasks, with the result that a required DLL or private memory space can't be set up at same address used in the parent.
fork() without exec() can make sense in the context of a process-per-connection application server (like SSH). I've also used it quite effectively as a threading alternative in some scripting languages.
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(). Like a lot of POSIX APIs, it's kind of overcomplicated, but it does solve a lot of the problems with fork/exec.
> And as long as I'm asking stupid questions, why would anyone ever use vfork?
For processes with a very large address space, fork() can be an expensive operation. vfork() avoids that, so long as you can guarantee that it'll immediately be followed by an exec().
fork with copy-on-write semantics avoids copying the whole address space. It does have to copy some data structures that manage virtual memory and maybe the first level of the paging structure(page directory or whatever).
From "Operating Systems: Three Easy Pieces" chapter on "Process API" (section 5.4 "Why? Motivating The API") [1]:
... the separation of fork() and exec() is essential in building a UNIX shell,
because it lets the shell run code after the call to fork() but before the call
to exec(); this code can alter the environment of the about-to-be-run program,
and thus enables a variety of interesting features to be readily built.
...
The separation of fork() and exec() allows the shell to do a whole bunch of
useful things rather easily. For example:
prompt> wc p3.c > newfile.txt
In the example above, the output of the program wc is redirected into the output
file newfile.txt (the greater-than sign is how said redirection is indicated).
The way the shell accomplishes this task is quite simple: when the child is
created, before calling exec(), the shell closes standard output and opens the
file newfile.txt. By doing so, any output from the soon-to-be-running program wc
are sent to the file instead of the screen.
As an explanation it doesn't make much sense, because there are other ways to alter the environment of the about-to-be-run program (see any non-Unix OS for examples).
Because "fork" was easy to implement in UNIX on the PDP-11.
The original implementation was for a machine with very limited memory. So fork worked by swapping out the process. But then, instead of releasing the in-memory copy, the kernel duplicated the process table entry. So there were now two copies of the process, one in memory and one swapped out. Both were runnable, even if there wasn't enough memory for both to fit at once. Both executed onward from there.
And that's why "fork" exists. It was a cram job to fit in a machine with a small address space.
# function1 and funtion2 are shell functions
$ function1 | grep foo | function2
here, the shell must fork a process (without exec) to run one of these functions.
For instance function1 might run in a fork, the grep is a fork and exec of course, and function2 could be in the shell's primary process.
In the POSIX shell language, fork is so tightly integrated that you can access it just by parenthesizing commands:
$ (cd /path/to/whatever; command) && other command
Everything in the parentheses is a sub-process; the effect of the cd, and any variable assignments, are lost (whether exported to the environment or not).
In Lisp terms, fork makes everything dynamically scoped, and rebinds it in the child's context: except for inherited resources like signal handlers and file descriptors.
Imagine every memory location having *earmuffs* like a defvar, and being bound to its current value by a giant let, and imagine that being blindingly efficient to do thanks to VM hardware.
I use fork a lot in my Python science programs. It's really great - you can stick it in a loop and get immediate parallelism. It's much better than multiprocessing, etc, as you keep the state from just before the fork happened, so you can share huge data structures between the processes, without having to process the same data again or duplicate them. I've even written a module for processing things in forked processes: https://pypi.org/project/forkqueue/
Splitting fork and exec allows you to do stuff before calling exec, for example redirecting file descriptors (like stdin/out/err), creating a pipe, modifying the child's environment, and so on.
There are so many variations to what you can do with fork+exec that designing a suitable "fork-exec combo" API is really difficult, so any attempts tend to yield a fairly limited API or a very difficult-to-use API, and that ends up being very limiting to its consumers.
On the flip side, fork()+exec() made early Unix development very easy by... avoiding the need to design and implement a complex spawn API in kernel-land.
Nowadays there are spawn APIs. On Unix that would be posix_spawn().
> And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
(Not a stupid question.)
You'd use vfork() only to finish setting up the child side before it execs, and the reason you'd use vfork() instead of fork() is that vfork()'s semantics permit a very high performance implementation while fork()'s semantics necessarily preclude a high performance implementation altogether.
I think it's actually a pretty useful primitive for doing multiprocessing. Unlike threading, you have a completely separate memory space both for avoiding data races and performance (memory allocators still aren't perfect and weird stuff can happen with cache lines). Unlike exec after fork or anything equivalent, you still get to share things like file descriptors and read only memory for convenience.
> Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork).
If you eliminate fork, then what do you do for those 1% of cases where you actually do need it? I agree that it's uncommon, but I have written code before that calls fork() but then does not exec().
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(3).
> And why has it not replaced fork for 99% of use cases?
Even though it's been around for about 20 years, it's still newer than fork+exec, so I assume a) many people just don't know about it, or b) people still want to go for maximum compatibility with old systems that may not have it, even if that's a little silly.
Lacking fork(), if you want to multi-process a service, you have to spawn (vfork()+exec() or posix_spawn(), or whatever) the processes and arrange for them to get whatever state and resources they need to start up. It's a pain, but I've done it.
You might want to move around some file descriptors if you don't want the child process to inherit your stdin/stdout/stderr (e.g. if you want to read the stdout of the process you launched, or give it some stdin).
And there does exist such a fork-exec combo - posix_spawn. It allows adding some "commands" of what file descriptor operations to do between the fork & exec before they're ever done, among some other things. But, as the article mentions, using it is annoying - you have to invoke various posix_spawn_file_actions_* functions, instead of the regular C functions you'd use.
The whole idea of fork is strange - the design pattern of "child process is executing exactly where the parent process is executing" is foreign to me. Don't we want to direct where the child process is executing? Like, when creating a thread? Why is fork() so conceptually orthogonal to that? Is there a good reason? A historical reason?
I don't find fork() to be obvious or useful or natural. I work hard to never do it.
Oh I understand how it works. I implemented it, in the first POSIX implementation. I just don't get how anybody wants to do that.
Yes, there's the example right there. But it shows the awkwardness immediately - decoding what the f happened by checking a side effect (is pid == 0? wtf?)
How about spoon(handle_connection, ...) or something like that? See how much better?
If you want the child to start executing some other code but you have fork(), it's easy to do it yourself by calling that function.
But on the other hand, if you do want the child to execute code at the same place as the parent, but a hypothetical fork() asks you to provide a function pointer, it would be a bit more complicated.
It's a leaky abstraction and everything it does can be done manually, and possibly better. It exists purely because, at some point in the past, threads didn't exist.
If you design your program without fork, you'll probably end up with a cleaner and faster solution. Some things are best forgotten or never learned in the first place.
The beauty of (v)fork(+exec) is that it doesn't need a new interface for configuring the environment in whichever way you want before the other process starts. Instead you get to use the exact same means of modifying the environment to your needs, and once it's done, you can call exec and the new process inherits those things.
I mean, just look at the interface of posix_spawn.
I grant though that this isn't without its problems (including performance) and IMO e.g. FD_CLOEXEC is one example of how those problems can be patched up. It's like the reverse problem: you have too wide implicit interface in it, and then you need to come up with all these ways to be explicit about some things.
Add to that, fork is (was) very inefficient. You had to duplicate the entire process state (page tables etc). Then the damn program would exec(), and you would tear it all down again. Took 100ms on older computers. Complete waste.
We would resort to making a weak copy, with page tables faulting in only if you used them. A lot of drama, so the user could make a goofy call that they didn't really want most of the time.
Another option is to allow the parent to create an empty child process, and then make arbitrary system calls and execute code in the child, like a debugger does. In most cases the last "remote system call" would be exec.
One use case for fork()--which is used extensively on Android--is to build an expensive template process that can then be replicated for later work, which is exactly what people often want for the behavior with virtual machines. I wrote an article on the history of linking and loading optimizations leading up to how Android handles their "zygote" which touches on this behavior.
We had the case that some library we were using (OpenBLAS) used pthread_atfork. Unfortunately, the atfork handler behaved buggy in certain situations involving multiple threads and caused a crash. This was annoying because we basically did not need fork at all but just fork+exec (for various other libraries spawning sub processes), where those atfork handlers would not be relevant.
Our solution was to override pthread_atfork to ignore any functions, and in case this is not enough, also fork itself to just directly do the syscall without calling the atfork handlers.
posix_spawn() shouldn't call atfork handlers. It's allowed to call them or not call them because implementors can use fork(), which must call them, or they can use vfork(), which must not call them -- or they can make posix_spawn() a proper system call, too, or they can use clone(), or my putative avfork(), or whatever.
If you used vfork(), you wouldn't have had this problem.
Fork-safety issues arise mainly because of the sharing of resources between the parent and child. pthread_atfork() exists mainly to allow libraries to add a measure of fork-safety by letting them disable things on the child-side of fork() or re-set-up things on the child-side of fork(). For example, a PKCS#11 provider might need to create a new connection to the tokens and re-C_Login() to them (except, since it really can't quite do that, most likely it must render every session inoperable on the child-side). (Indeed, PKCS#11 specifically mandates that on the child-side of fork all sessions must be dead and must not be used.)
The good/evil/etc. here seem to be defined exclusively around "performance above all else", and - more specifically - performant primitives over performant application architecture.
It strikes me that performance gains associated with sharing address space & stack are similar to many performance gains: trade-offs. So calling them "good" and "evil" when performance is seemingly your sole goal and interest seems a bit forward.
In my world we often say things like "X is the moral equivalent of Y" where X and Y are just technologies and, clearly, are morally-neutral things.
Why do we do this? Well, because it adds emphasis, and a dash of humor.
Clearly fork() is neither Good nor Evil. It's morally neutral. It has no moral value whatsoever. But to say "fork() is evil" is to cause the audience to raise their eyebrows -"what, why would you say fork() is evil?!"- and maybe pay attention.
Yes, there is the risk that the audience might react dismissively because fork() obviously is morally-neutral, so any claim that it is "evil" must be vacuous or hyperbolic. It's a risk I chose to take.
Really, it's a rhetorical device. I think it's pretty standard. I didn't create that device myself -- I've seen it used before and I liked it.
Morally-neutral does not equate to neutral insofar as I think most technologists consider some tech to be "good" and some to be "bad" in a practical sense.
"Good -vs- evil" is obviously hyperbolic - particularly the latter - but outside of morals they still imply a tendency to be technically/practically good or bad in an objective sense. So discounting it as a mere rhetorical device seems overly dismissive.
Fork() is the second worst idea in programming, behind null pointers. Fork() is the reason overcommit exists, which is the reason my web browser crashes if I open too many tabs, and the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library. It's a clear example of "worse is worse", and we should have switched to the Microsoft Windows model decades ago.
Here's a paper from Microsoft Research supporting this point of view:
> the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library
Linux overcommitment is often cited as an argument for the "panic on OOM" design of the allocating parts of the Rust standard library, and it's an important part of the story. But I think even if the Linux defaults were different, Rust would still have gone with the same design. For example, here's Herb Sutter (who works for Microsoft) arguing that C++ would benefit from aborting on allocation failure: https://youtu.be/ARYP83yNAWk?t=3510. The argument is that the vast majority of allocations in the vast majority of programs don't have any reasonable options for handling an alloc failure besides aborting. For languages like C++ and Rust, which want to support large, high-level applications in addition to low-level stuff, making programmers litter their code with explicit aborts next to every allocation would be really painful.
I think it's very interesting that Zig has gone the opposite direction. It could be that writing big applications with lots of allocs ends up feelign cumbersome in Zig, or it could be that they bend the curve. Fingers crossed.
Why overcommit is a problem? A program is unlikely to use all the memory that it allocates, or use it only at a later time. It would be a waste to not have it, it would mean having a ton of RAM that never gets used because a lot of programs allocates more ram that they will probably ever need. And it would be inefficient, costly and error prone to use dynamic memory allocation for everything.
The cause of your browser crash is not the overcommit, is simply the fact that you have not enough memory. If you disable overcommit (something you can do on Linux) you would the same crash earlier, before you allocated (not necessary used) 100% of your RAM (because really no software handles the dynamic memory fail condition, i.e. malloc returning null, that you can't handle reasonably).
Null pointers are not a mistake, how do you signal the absence of a value otherwise? How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)? null makes a perfect sense to be used as a value to signal "this pointer doesn't point to something valid".
Microsoft saying that fork() was a mistake... well, of course, because Windows doesn't have it. fork was a good idea and that is the reason why it's still used these days. Of course nowadays there are evolution, in Linux there is the clone system call (fork is deprecated and still there for compatibility reasons, the glibc fork is implemented with the clone system call). But the concept of creating a process by cloning the resources of the parent is something that to me always seamed very elegant to me.
In reality fork is something that (if I remember correctly, I don't have that much experience in programming in Windows) doesn't exist on Windows, and the only way to create a new process of the same program is to launch the executable, and pass the parameters from the command line, that is not that great for efficiency at all, and also can have its problems (for example the executable was deleted, renamed, etc while the program was running). Also in Windows there is neither the concept of exec, tough I think it can be emulated in software (while fork can't).
To me it makes perfect sense to separate the concept of creating a new process (fork/clone) and loading an executable from disk (exec). It gives a lot of flexibility, at a cost that is not that high (and there are alternatives to avoid it, such as vfork or variations of the clone system call, or directly higher level API such as posix_spawn).
I think much of the confusion around nulls stems from the fact that in mainstream languages pointers are overloaded for two purposes: for passing values by reference, and for optionality.
Nearly every pointer bug is caused by the programmer wanting one of these two properties, and not considering the consequences of the other.
Non-nullable references and pass-by-value optionals can replace many usages of pointers.
>How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)?
Rust does this with the Result and Option "enums", which are internally implemented as tagged unions. From my understanding the only overhead with this implementation is the size taken by the tag and then any padding required for alignment.
It also helps that references in Rust are not nullable and working with pointers is fairly rare, so the type system can do a lot of heavy lifting for you rather than putting null checks all over the place. When you have &T you never have to worry about handling null in the first place!
The inventor, Tony Hoare, famously called them his "billion-dollar mistake". The better way to do it is with nullable types (which could internally represent null as 0 as a performance optimization). This is something Rust gets right.
Windows doesn't have fork as you know it. It has a POSIX-ish fork-alike for compliance, but under the hood it's CreateThread[0] with some Magic.
in Windows, you create the thread with CreateThread, then are passed back a handle to that thread. You then can query the state of the thread using GetExitCodeThread[1] or if you need to wait for the thread to finish, you call WaitForSingleObject [2] with an Infinite timeout
Aside: WaitForSingleObject is how you track a bunch of stuff: semaphores, mutexes, processes, events, timers, etc.
The flipside of this is that Windows processes are buckets of handles: a Process object maintains a series of handles to (threads, files, sockets, WMI meters, etc), one of which happens to be the main thread. Once the main thread exits, the system goes back and cleans up (as it can) the rest of the threads. This is why sometimes you can get zombie'd processes holding onto a stuck thread.
This is also how it's a very cheap operation to interrogate what's going on in a process ala Process Explorer.
If I had to describe the difference between Windows and Linux at a process model level, I have to back up to the fundamental difference between the Linux and Windows programming models: Linux is is a kernel that has to hide its inner workings for its safety and security, passing wrapped versions of structures back and forth through the kernel-userspace boundary; Windows is a kernel that considers each portion of its core separated, isolated through ACLs, and where a handle to something can be passed around without worry. The windows ABI has been so fundamentally stable over 30 years now because so much of it is built around controlling object handles (which are allowed to change under the hood) rather than manipulation of of kernel primitives through syscalls.
Early WinNT was very restrictive and eased up a bit as development continued so that win9x software would run on it under the VDM. Since then, most windows software insecurities are the result of people making assumptions about what will or won't happen with a particular object's ACL.
There's a great overview of windows programming over at [3]. It covers primarily Win32, but gets into the NT kernel primitives and how it works.
A lot of work has gone into making Windows an object-oriented kernel; where Linux has been looking at C11 as a "next step" and considering if Rust makes sense as a kernel component, Windows likely has leftovers of Midori and Singularity [4] lingering in it that have gone onto be used for core functionality where it makes sense.
Overcommits exist any time you can have a debugger anyways.
fork() was a brilliant way to make Unix development easy in the 70s: it made it trivial move a lot of development activity out of the kernel and into user-land.
But with it came problems that only became apparent much later.
unpopular opinion: null pointers (in at least java and c) are the single greatest metaphor in software development, and are the CS analog to the invention of zero
There was an article about exceptions the other day that lamented that exceptions are high latency because the exceptional path will be paged out. I would assume overcommit is to blame for that too.
That's probably a caching issue, and caching issues are a fact of life for the foreseeable future. (Could also be a disk swap issue, but probably not.)
"I won't bother explaining what fork(2) is -- if you're reading this, I assume you know.", If that applied to everything I looked at from HN I'd read precious little.
I didn't write it for HN. It wasn't a paper to publish in some Computer Science journal. It was just a github gist. If you don't get the subject, it's not for you. I might well write a paper now based on it, and then it might be a good read for you, but I still won't be writing it for you, but for people who are interested in the topic. The intended audience is small, expert on the matter, and probably even more opinionated than I am.
I found the article well written and informative even though it's not my area of expertise, I intended my comment as a light hearted reflection of the fact that a lot of articles on HN go over my head but are still worth a read to me, just like your article.
For those saying to use posix_spawn: What am I supposed to make of the writeup in the posix_spawn manpage though?
"...specified by POSIX to provide a standardized method of creating new processes on machines that lack the capability to support the fork(2) system call. These machines are generally small, embedded systems lacking MMU support"
Is this why no one uses it? It has this gratuitous opinion piece at the beginning that makes people think it's just for embedded systems and my dad's Amiga?
That's just some injected opinion, I assume from someone contributing to glibc who doesn't like posix_spawn I guess? In any case it is wrong.
Don't assume what is written in man pages is the truth. Some of them have a lot of opinion added. It can be useful to cross-check man pages between systems - they don't always call out non-portable options or behavior.
On some kernels posix_spawn is a syscall or specifies flags that make it more efficient than fork+exec. Darwin is one such system, though you can use POSIX_SPAWN_SETEXEC if you still want to replace the current process with a new executable rather than creating a child.
Hah, that's pretty funny. Regardless of the motivation as written, the motivation I surmise is:
- some systems (e.g., Windows) lack fork() for various reasons
- vfork() is baaaad
- I know, let's do something like WIN32's spawn() or CreateProcess(), but, like, better
The middle item I have good reason to think is very likely. vfork() still has a bad rap from that old "vfork() Considered Dangerous" paper. That paper circulated a lot way back when, and was the reason vfork() was removed from some Unixes for a while (well, it was left as an alias of fork()) before it was eventually re-added. The Open Group participants would been very aware of that paper, and that is almost certainly the reason that POSIX says about vfork():
Conforming applications are recommended not
to depend on vfork(), but to use fork() instead.
The vfork() function may be withdrawn in a
future version.
So if fork() can't perform well, and the committee won't recommend the use of vfork(), what shall the committee do? Answer: design and specify posix_spawn(). It's not an unreasonable answer. Though, IMO of course, they should have un-obsoleted vfork().
Meta comment: Github Gist seems to be great for blogging. Yeah, the UI is not very blog-specific, but it has all the useful features, and then some: markdown, comments, hosting, an index of all posts, some measure of popularity (stars), a very detailed edit history, etc.
All without having to pay or setup anything yourself.
Unfortunately, there's no way to turn off comments on a Gist, which makes it not a viable replacement for anyone who doesn't want to spend a lot of time processing and moderating comments.
Good point. However, you need a GitHub account to post comments so everyone knows who you are. Your reputation might suffer if you constantly post comments that require moderation.
This avfork implementation is poor. You don't want to make your single threaded programs multi-threaded. I don't really get the big benefit of afork over other existing mechanisms other than handwaving about things being evil.
Also,
> Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via libsocket on top of STREAMS proved to be a very long and costly mistake. Emulating one API from another API with impedance mismatches is difficult at best.
Linux does have a thread creation system call. It's clone(2). It literally creates new threads of execution with various properties. It does not "emulate" threads, it is threads.
You do, but it's not a good implementation for a general API is all I was trying to say.
Do you really need an "asynchronous process creation" call? The rationale is that "blocking is bad", but a thread creation system call blocks the caller too until the thread is created. So it's not just "blocking", it's the amount of blocking if anything. Is posix_spawn or vfork+exec really too slow for your case?
Then multi-process and multi-threading seems like a reasonable solution. Asynchronous system calls are the exception not the rule in unix. So it wouldn't make sense as a traditional afork(2) system call. You could probably do a posix_spawn for io_uring, but do you really need to?
- @famzah'z blog about fork vs vfork vs clone performance:
https://blog.famzah.net/tag/fork-vfork-popen-clone-performance/
- A very similar idea to my afork() idea, from 2 years earlier:
https://developers.redhat.com/blog/2015/08/19/launching-helper-process-under-memory-and-latency-constraints-pthread_create-and-vfork
- misc
https://inbox.vuxu.org/tuhs/CAEoi9W6HFL3UcnWkKoqka8Dt16MWskKd6yEJr3HYCcCT9pMTig@mail.gmail.com/T/
https://bugzilla.redhat.com/show_bug.cgi?id=682922 (see attachments)
The intent of fork() is to start a new process in its own address space. That *fork() variations that run in the SAME address space are confusing. A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages. But generally fork() is very specific from my recollection.
> The intent of fork() is to start a new process in its own address space.
True!
> That *fork() variations that run in the SAME address space are confusing.
Why is it confusing? They are distinct and different system calls, with different semantics. They are also sufficiently similar that they are also similarly named. But there's nothing confusing about their semantics. vfork() is not harder to use than fork() -- it's just subtly different.
> A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages.
I wouldn't expect that. Sandboxing is a large and complex topic.*
Amusingly vfork semantics differ across OSes. This program prints 42 in Linux but 1 on Mac: https://godbolt.org/z/jn7Gaf5Me because on Linux they share address space.
Unfortunately there was this paper from the 80s titled "vfork() Considered Dangerous", which led to BSDs removing vfork(), and then later it was re-added because that paper was clearly quite wrong. But the news hasn't quite filtered through to Apple, I guess.
I am pretty sure Mac OS doesn't COW fork(), and that the address space is copied. At least it was the last time I looked. FreeBSD and Linux both seem to COW.
My (very possibly wrong) understanding is that xnu does CoW fork but doesn't overcommit, meaning that memory must be reserved (perhaps in swap) in case the pages need to be duplicated.
There's other complications relating to inheriting Mach ports and the mach_task <-> BSD process "duality" in xnu, which Linux doesn't have. I'd love for someone to chime in who knows more about how this stuff works.
I started with DOS, where spawn() is the norm, so I've always considered the fork()-like behaviour to be unusual yet handy for certain use-cases. Perhaps a system call that offers a combination of the two behaviours should be named spork().
- vfork() is O(1)
- copying fork() is O(N) where N is the
amount of writable memory in the parent's
address space
- copy-on-write fork() is O(N) where N is
the resident set size (RSS) of the parent
O(1) beats O(N).
And O(N) is just the complexity of fork() for a single-threaded parent process. Now imagine a very busy, threaded, large-RSS process that forks a lot. You get threads and child processes stepping all over each other's CoW mappings, causing lots of page faults and copies. Ok, that is still O(N), but users will feel the added pain of all those page faults and TLB shootdowns.
Ok but you're just repeating "It's inefficient" and not saying in any way for what use is its inefficiency even noticeable. I want to reason about when I would care. You see?
The first link didn't even have units on its numbers(!) I assume they're milliseconds. When does that scale become something one would care about at all? Not launching a gui process. Not a shell pipeline. So when is this issue arising at all? What is being done that makes fork inefficiency anything other than academic interest. Must be something, right? Forking webserver?
It's inherently inefficient because while the child process does its initialization (pre-exec) stuff, the parent gets page faults for every thread writing into the memory due to COW. This will basically stall the parent and can cause funny issues.
In another comment, I observe how Go doesn't even have a binding to fork.
Erlang is another example of that. There is no standard library binding to the fork function. If someone were to bash one into a NIF, I have no idea what would happen to the resulting processes, but there's no good that can come of it. (To use Star Trek, think less good and evil Kirk and more "What we got back, didn't live long... fortunately.") Despite the terminology, all Erlang processes are green threads in a single OS process.
> Despite the terminology, all Erlang processes are green threads in a single OS process.
The main Erlang runtime uses an M:N Erlang:native process model, not an N:1. So Erlang processes are like green threads (they are called processes instead of threads because they are shared-nothing), but not in a single process.
I mentioned this somewhere else but I thought Erlang does NOT share memory.
Doesn’t that make Erlang a bit unique. It was the ability to spawn a new process extremely fast AND also have memory isolation. This combination is what the OP was wanting to achieve.
My reference to “fast” was in the context of creating a new process due to the OP post talking about how long fork/etc can take. Not in reference to executing code itself.
The problem is clone is more of a start phase after vfork but before fork regardless for github. So it's kind of a bit strange that we call vfork first but that is about templates too.
As for templates they need to be in different languages and in different formats for video games consoles, and so many other formats they port systems and games that sort of work digitally to certain things but not playable to certain things too.
The other problem is that clone is part of syscall interfaces and part of apis and part of a lot of other things too.
It's a rhetorical device. I didn't expect this to -years later- become a front-page item on HN. I wrote that to share with certain people.
And yes, clone() has some real problems, and if calling it "stupid" pisses off some people, but maybe also leads others to want to improve clone() or create a better alternative, then that's fine. If I'd wanted to write an alternative to Linux I'd probably have had to deal with the very, very fine language that Linus and others use on the Linux kernel mailing lists -- if you don't like my using the word "stupid", then you really shouldn't look there because you're likely to be very disappointed. Indeed, not only would I have to accept colorful language from reviewers there, I'd probably have to employ some such language myself.
TL;DR: clone() came from Linux, where "stupid" is the least colorful language you'll find, and me calling it "stupid" is just a rhetorical device.
The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
It seems like this 2019 paper covers this point, and the content in the gist? I was expecting to see a reference to it
A fork() in the road
https://news.ycombinator.com/item?id=19621799
Although it does say that vfork() is difficult to use safely, while the gist recommends it? I think there is still some clarity needed around the use cases.
Fork today is a convenient API for a single-threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them. In other words, a shell. It’s no surprise that the Unix shell was the first program to fork [69], nor that defenders of fork point to shells as the prime example of its elegance [4, 7]. However, most modern programs are not shells. Is it still a good idea to optimise the OS API for the shell’s convenience?
As u/amaranth pointed out, my gist predates the MSFT paper, which mostly explains why I didn't reference. Though, to be fair, I saw that paper posted here back in 2019, and I commented on it plenty (13 comments) then. I could have edited my gist to reference it, and, really, probably should have. Sometime this week I will add a reference to it, as well as this and that HN post, since they are clearly germane and useful threads.
I vehemently disagree with those who say that vfork() is much more difficult to use correctly than fork(). Neither is particularly easy to use though. Both have issues to do with, e.g., signals. posix_spawn() is not exactly trivial to use, but it is easier to use it correctly than fork() or vfork(). And posix_spawn() is extensible -- it is not a dead end.
My main points are that vfork() has been unjustly vilified, fork() is really not good, vfork() is better than fork(), and we can do better than vfork(). That said, posix_spawn() is the better answer whenever it's applicable.
Note that the MSFT paper uncritically accepts the idea that vfork() is dangerous. I suspect that is because their focus was on the fork-is-terrible side of things. Their preference seems to be for spawn-type APIs, which is reasonable enough, so why bother with vfork() anyways, right? But here's the thing: Windows WSL can probably get a vfork() added easily enough, and replacing fork() with vfork() will generally be a much simpler change than replacing fork() with posix_spawn(), so I think there is value in vfork() for Microsoft.
Use cases for vfork() or afork()? Wherever you're using fork() today to then exec, vfork() will make that code more performant and it generally won't take too much effort to replace the call to fork() with vfork(). afork() is for apps that need to spawn lots of processes quickly -- these are rare apps, but uses for them do arise from time to time. But also, afork() should be easier to use safely than vfork(). And, again, for Microsoft there is value in vfork() as a smaller change to Linux apps so they can run well in WSL.
BTW, see @famzah's popen-noshell issue #11 [0] for a high-perf spawn use case. I linked it from my gist, and, in fact, the discussion there led directly to my writing that gist.
1 reply →
The gist seems to be from 2017 so it wouldn't have been able to reference that paper.
I've updated the gist to include that, this, and many other links.
I too could use some more clarity around the use cases
1 reply →
> "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells."
Yes, but why is this characterized as something negative?
Isn't that the entire point? Operating systems are there to serve user requests, and shells are an interface between user and OS.
Shells simply developed features that users required of them.
> Isn't that the entire point?
The exokernel people would disagree.
You see, an operating system as commonly conceived has at least two major jobs:
- abstract away underlying hardware
- safely multiplex resources
And do the above with as little overhead as possible.
Now the thing is: whenever you have multiple goals, you need to make trade-offs, and you aren't as good at any one goal as you could be.
So the exokernel folks made a suggestion in the 90s: let the OS concentrate on safely multiplexing resources, and do all the abstracting in user level libraries.
See eg https://www.classes.cs.uchicago.edu/archive/2019/winter/3310... or https://people.eecs.berkeley.edu/~kubitron/cs262/handouts/pa...
Normal application programming would mostly look the same as before, your libraries just do more of the heavy lifting. But it's much easier to swap out different libraries than it is to swap out kernel-level functionality.
That vision never caught on with mainstream OSes. But: widespread virtualisation made it possible. You can see hypervisors like Xen as exokernel OSes that do the bare minimum required to safely multiplex, but don't provide (many) abstractions.
Shells have relatively simple operational models, so _any_ API would probably be workable for shells.
Meanwhile, programs with more complex requirements have to work around these APIs. And many programs call other programs, or otherwise have to do tricky process lifecycle management.
The lowest-level APIs should, in theory, cater to the most complex cases, not to the simplest ones. This doesn't prevent a simpler API from existing, but catering to a simple use case in the primitives does hinder more complex needs.
(I think the more nuanced point is that the OS itself might not have a much better design available in any case. Unixes have a lot of neat stuff, but it's a lot of "design by user feature request", and "standardize 4 slightly different ways of doing things", so there is a lot of weirdness and it's hard to have The Perfect API in that case)
4 replies →
> Yes, but why is this characterized as something negative?
Unfortunately, the text does not provide sufficient context. Shell are not properly supported in any OS (probably except plan9), since 1. the OS provides no enforcement or convention of CLI API interface (there is no enforced encoding standard or checkable stuff), 2. the OS provides no rules for file names to be shell-friendly and 3. there are no dedicated communication channels towards shells or in between programs and shells.
So all in all, shells remain a hack around the system that is "simple to implement the initials" and is annoying to use and write at many corner cases.
> Shells simply developed features that users required of them.
Cross out "simply" and call it convenience+arbitrary complex scripting glue for 4 main goals: 1. piping 2. basic text processing 3. basic job control 4. path hackery
Shells haven't been the primary interface between the user and the OS for decades.
5 replies →
That is the most glorious ** that i've read all day.
Larry Wall, creator of Perl, famously wrote that "It is easier to port a shell than a shell script."
https://en.wikipedia.org/wiki/Shell_script
So we can write operating systems easily if it's just an infinite superloop?
Can you elaborate your grievances in more detail than "your comment stinks"?
2 replies →
And would you consider that a good thing or a bad thing?
1 reply →
In Ninja, which needs to spawn a lot of subprocesses but it otherwise not especially large in memory and which doesn't use threads, we moved from fork to posix_spawn (which is the "I want fork+exec immediately, please do the smartest thing you can" wrapper) because it performed better on OS X and Solaris:
https://github.com/ninja-build/ninja/commit/89587196705f54af...
posix_spawn also outperforms fork on Linux under more recent glibc and musl, which can use vfork under the hood. https://twitter.com/ridiculous_fish/status/12328893907639336...
The issue with posix_spawn is that you can't close all descriptors before exec. This is especially an issue as most libraries are still unaware they need to open every single handle with the close-on-exec flag set.
Closing all descriptors is next to useless; you usually need to inherit at least standard in/out/error.
What you need is an operation like "close all descriptors >= N", as posix_spawn opcode.
1 reply →
Solaris/Illumos has an extension[0] for that.
3 replies →
> Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec() and _spawn(), the last being a Windows-ism.
I appreciate this quite a bit. Vocal Unix proponents tend to believe that anything Unix does is automatically better than Windows, sometimes without even knowing what the Windows analogue is. Programming in both is necessary to have an informed opinion on this subject.
The one thing I miss most on Unix: the unified model of HANDLEs that enables you to WaitOnMultipleObjects() with almost any system primitive you could want, such as an event with a socket (blocking I/O + a shutdown notification) in one call. On Unix, a flavor of select() tends to be the base primitive for waiting on things to happen, which means you end up writing adapter code for file descriptors to other resources, or need something like eventfd.
Things I don't miss from Windows at all: wchar_t everywhere. :)
WIN32 got a few things very right:
Many other things, Windows got wrong. But the above are far superior to what Unix has to offer.
How are SIDs the right thing?
Superficial silliness like allocating 48 bits to encode integers in [0,18] aside, what problem do structured SIDs actually solve? I’ve been trying to figure that out for the last couple of days and I still don’t get it, possibly because the Windows documentation doesn’t seem to actually say it anywhere.
I completely agree with having UUIDs or something in that vein for user and group IDs and will not dismiss IDs for sessions and such in the same namespace (although haven’t actually seen a use case for those), but structured variable-length SIDs as NT defines them just don’t make sense to me.
6 replies →
I’d add an I/O interface to the kernel that was built to be asynchronous from Day 0.
2 replies →
I'd be curious how many of those derive from NT's VMS roots - for instance:
http://lxmi.mi.infn.it/~calcolo/OpenVMS/ssb71/6346/6346p004....
2 replies →
These decisions here are all older than Windows and weren't in reaction to them. It's in reaction to the awful mainframe ways to spawn processes like using JCL.
We've sort of come back to that with kubernetes yaml files to describe how to launch an executable in a specific env and all of the resources it needs. Like it can be traced explicitly, the Borg paper references mainframes and knowingly calls the language that would be replaced by kubernetes's yaml files 'BCL' instead of z/OS's JCL.
Plan9 is a lot older than Kubernetes and has the same namespacing of all processes. So it's not impossible to have a "*nix like" OS that still has mainframe-like separation of concerns to ease deployment.
5 replies →
Having written server software that had to work in both places, I always loved the simplicity of fork(2) / vfork(2) relative to Windows CreateProcess. Threading models in Win32 were always a pain. Which only got worse with COM (remember apartment threading? rental threading? ugh)
Back in the 90's, processes had smaller memory footprint, and every UNIX my software supported had COW optimizations. So the difference between fork(2) and vfork(2) were not very large in practice. Often, the TCP handshake behind the accept(2) call was of more concern than how long it would take fork(2) to complete. Of course, bandwidth has increased by a factor of 1000 since then, so considerations have changed.
It's how CreatProcess handles commandline argument that infuriates me - not as an argv array but a big string. It's so difficult to work around quoting.
The problem with WaitForMultipleObjects (WFMO) is that it's limited to 64 handles, which basically makes it useless for anything where the number of handles is dynamic as opposed to static. There are ways to get around this limitation by grouping handles into trees, but it's tremendously clunky.
UCS-2 seemed like a good(ish) idea at the time when Unicode's scope didn't include every possible human concept represented in icon form and UTF-8 hadn't yet been spec'd on a napkin by the first adults to bother thinking about the problem.
Even in 1989, it should have been clear that 16 bits were not enough to encode all of the Chinese characters, let alone encoding all the human scripts. Unicode today encodes 92,865 Chinese characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).
The only reason anybody would think of UCS-2 was a good idea was that they did not consult a single Chinese or Japanese scholar on Chinese characters.
1 reply →
Quite true. One of the things Windows got very wrong was UCS-2 and, later, UTF-16. So did JavaScript.
4 replies →
Is there any difference between Windows HANDLE and Linux file descriptor? Aren't they both just indexes into a table of objects managed by the kernel?
HANDLE values are opaque, and generally not reused. Imagine an implementation like this:
where `ptr` might be an index into a table (much like a file descriptor) or maybe a pointer in kernel-land (dangerous sounding!) and `verifier` is some sort of value that can be used by the kernel to validate the `ptr` before "dereferencing" it.
On Unix the semantics of file descriptors are dangerous. EBADF can be a symptom of a very dangerous bug where some thread closed a still-in-use FD then a open gets the same FD and now maybe you get file corruption. This particular type of bug doesn't happen with HANDLEs.
7 replies →
Isn't HANDLE basically fd?
FD has been gradually turned into HANDLE.
Well, I'm surprised to see this on the front page, let alone as #1. Ask me anything.
EDIT: Also, don't miss @NobodyXu's comment on my gist, and don't miss @NobodyXu's aspawn[1].
Since you said anything... This is not strictly related to the article but your expertise seems to be in the right area.
I have a process that executes actions for users, at the moment that process runs as root until it receives a token indicating an accepted user, then it fork()s and the fork changes to the UID of the user before executing the action.
Is there a better way? I hadn't actually heard of vfork() before reading this article. I'm guessing maybe you could do a threaded server model where each thread vfork()s. I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
If the parent is threaded, then yes, vfork() will be better. You could also use posix_spawn().
As to "becoming a user", that's a tough one. There are no standard tools for this on Unix. The most correct way to do it would be to use PAM in the child. See su(1) and sudo(1), and how they do it.
> I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
Yes, fork() only copies the calling thread. The other threads' stacks also get copied (because, well, you might have pointers into them, who knows), but there will only be one thread in the child process.
vfork() also creates only one thread in the child.
There used to be a forkall() on Solaris that created a child with copies of all the threads in the parent. That system call was a spectacularly bad idea that existed only to help daemonize: the parent would do everything to start the service, then it would forkall(), and on the parent side it would exit() (or maybe _exit()). That is, the idea is that the parent would not finish daemonizing (i.e., exit) until the child (or grandchild) was truly ready. However, there's no way to make forkall() remotely safe, and there's a much better way to achieve the same effect of not completing daemonization until the child (or grandchild) is fully ready.
In fact, the daemonization pattern of not exiting the parent until the child (or grandchild) is ready is very important, especially in the SMF / systemd world. I've implemented the correct pattern many times now, starting in 2005 when project Greenline (SMF) delivered into OS/Net. It's this: instead of calling daemon(), you need a function that calls pipe(), then fork() or vfork(), and if fork(), and on the parent side then calls read() on the read end of the pipe, while on the child side it returns immediately so the child can do the rest of the setup work, then finally it should write one byte into the write side of the pipe to tell the parent it's ready so the parent can exit.
What about fork(2) for network servers? I've written parallel network servers two ways; open the socket to listen on and call fork() N times for the desired level of parallelism, and just create N processes and use SO_REUSEPORT. I prefer the former. I suppose there is hidden option C of "have a simple process that opens the listening port and then vfork/execs each worker" I find that to be a bit strange because the code will be split into "things that happen before listening on the port" (which includes, e.g. reading configuration files) and "things that happen after listening on the port" (which includes, e.g. reading configuration files)
No questions yet as I am yet to read ... but I can already comment and say grade A title.
It's a bit opinionated. It's meant to get a reaction, but also to have meaningful and thought-provoking content, and I think it's correct in the main too. Anyways, hope you and others enjoy it.
4 replies →
Hard disagree to most of this.
fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.
vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.
clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.
And I don't see what afork gets you that clone doesn't, except afork isn't as general.
(This is a bit of a tangent, apologies.)
> fork(2) makes a lot more sense when you realize its heritage.
I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.
It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.
The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)
It's 100% the wrong API for spawning processes.
Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.
The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.
1 reply →
Practically, this is the struct you have to fill in if you don't use clone or fork.
https://github.com/torvalds/linux/blob/719fce7539cd3e186598e...
IMO clone looks a lot better than screwing with that giant struct and all of the kernel bugs that would exist from validating every goofy way those options could be setup wrong by user space.
afork() could do some things differently. The point of afork() is to be able to spawn child processes (that will exec-or-_exit) faster.
The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.
I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?
> Are you talking about the PDP-7?
Yes
> Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
clone was literally designed to support posix threads.
> What's the L4 approach?
Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.
3 replies →
> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().
What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?
A: It just wouldn't happen.
And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:
How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.
No, really, what you say about vfork() is lore, and very very wrong.
That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.
> And I don't see what afork gets you that clone doesn't, except afork isn't as general.
afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
> What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that
Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.
> No, really, what you say about vfork() is lore, and very very wrong.
Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".
> afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
Ok, but I'm not going to hold it against clone for being a more general solution.
> clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.
A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.
8 replies →
Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.
For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.
That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.
3 replies →
Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..
3 replies →
I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?
vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.
7 replies →
I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.
1 reply →
If you really think vfork() is hard to use because of the stack sharing, the avfork() should be good for you!
The code I currently work on actually has a use of `clone` with the `CLONE_VM` flag to create something that isn't a thread. Since `CLONE_VM` will share the entire address space with the child (you know, like a thread does!) a very reasonable response would be "WAT?!"
What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
We achieved this by using `CLONE_VM` (and a handful of other flags) to give the new "thread-like" entity access to the whole address space. But, we omitted `CLONE_THREAD`, as if we were making a new process. The new "thread-like" entity would not technically be part of the same thread group but would live in the same address space.
We also used two chained `clone()` calls (with the intermediate exiting, like when you daemonise) so that the new "thread-like" wouldn't be a child of the original process.
All this existed before I joined, it's just really cool that it works. I've never encountered a such a non-standard use of clone before but it was the right tool for this particular job!
> What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
I'm curious to hear more. What's its purpose?
> I'm curious to hear more. What's its purpose?
Sure! I'll try to illustrate the general idea, though I'm taking liberties with a few of the details to keep things simple(r).
Our software (see https://undo.io) does record and replay (including the full set of Time Travel Debug stuff - executing backwards, etc) of Linux processes. Conceptually that's similar to `rr` (see https://rr-project.org/) - the differences probably aren't relevant here.
We're using `ptrace` as part of monitoring process behaviour (we also have in-process instrumentation). This reflects our origins in building a debugger - but it's also because `ptrace` is just very powerful for monitoring a process / thread. It is a very challenging API to work with, though.
One feature / quirk of `ptrace` is that you can't really do anything useful with a traced thread that's currently running - including peeking its memory. So if a program we're recording is just getting along with its day we can't just examine it whenever we want.
First choice is just to avoid messing with the process but sometimes we really do need to interact with it. We could just interrupt a thread, use `ptrace` to examine it, then start it up again. But there's a problem - in the corners of Linux kernel behaviour there's a risk that this will have a program-visible side effect. Specifically, you might cause a syscall restart not to happen.
So when we're recording a real process we need something that:
* acts like a thread in the process - so we can peek / poke its memory, etc via ptrace * is always in a known, quiescent state - so that we can use ptrace on it whenever we want * doesn't impact the behaviour of the process it's "in" - so we don't affect the process we're trying to record * doesn't cause SIGCHLD to be sent to the process we're recording when it does stuff - so we don't affect the process we're trying to record
Our solution is double clone + magic flags. There are other points in the solution space (manage without, handle the syscall restarting problem, ...) but this seems to be a pretty good tradeoff.
[edit: fixed a typo]
3 replies →
Maybe some kind of snapshotting for an in-memory database?
This stuff is still all confused
Read http://catern.com/rsys21.pdf
What you want is:
1. create "embryonic" unscheduled process
2. Set it up from the parent process, it just lies on the operating table passively.
3. Submit it to the scheduler.
This is just....obviously correct. Totally flexible. Totally efficient. Hell, if you really want to fork anything, fork those embryonic process which have no active threads! Much safer and easier to understand!
I did not write the paper above, but I did write
https://lore.kernel.org/lkml/f8457e20-c3cc-6e56-96a4-3090d7d...
https://lists.freebsd.org/archives/freebsd-arch/2022-January...
I hope I or someone else will have time to make it happen!
When I was first learning about UNIX and similar OSes I just assumed that this is how things worked because this is the obvious way of doing it. Why would you fork a process, then try to determine which of the two processes you are, then fix whatever the parent process messed up in your global state, and only then execute what you actually wanted to do? That seems insane (I guess until you realize that the main use case is creating /bin/sh).
Me too!
But even when writing /bin/sh, I don't see why this would get in the way? I was once told earlier Unix didn't even have fork, but something more purpose-made for shells instead.
Sounds a bit like fuchsias launchpad library where you create launchpad object, do all the setup, and then call launchpad_go to actually start the process. Launchpad doesn't allow arbitrary syscalls in the setup, so in that sense it is maybe closer to "spawn" interface but with better ergonomics
https://cs.opensource.google/fuchsia/fuchsia/+/main:zircon/s...
Yes, it is basically the same thing. Fuschia has the capbilities mindset that would lead one here.
Yes, I like the larval process idea. No doubt it's good.
I was always disappointed by the performance of fork()/clone().
CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.
But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.
And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.
I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.
CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.
Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)
I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.
My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.
The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.
When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.
Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.
fork() is most problematic for things like Java.
Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.
It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.
However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.
[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accou...
> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.
> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.
2 replies →
POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.
2 replies →
You also mandate a system complex enough to have an MMU.
Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.
> clone() is stupid ... the clone(2) design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.
IMHO a bigger problem [2] in practice with clone is that (according to glibc maintainers) once your program calls it, you can't call any glibc function anymore. [1] Essentially the raw syscall is a tool for the libc implementation to use. The libc implementation hasn't provided a wrapper for programs to use which maintains the libc's internal invariants about things like (IIUC) thread-local storage for errno.
The author's aforkx implementation is something that glibc maintainers could (and maybe should) provide, but my understanding is that you can get in trouble by implementing it yourself.
[1] https://github.com/rust-lang/rust/issues/89522#issuecomment-...
[2] editing to add: or at least a more concrete expression of the problem. Wouldn't surprise me if they haven't provided this wrapper in part because the proliferation the author mentioned makes it difficult for them to do so.
It's really unfortunate that the sanctioned way to call Linux syscalls directly is via the syscall() function (previously the _syscallN macros), and both of those methods set errno on error, which fails in a clone() thread.
If only Glibc provided a syscall_r() or something that returns the raw return value whether it's an error or not.
It is possible to make syscall() (and regular libc syscalls like read()) work in a clone() thread. I use this in performance-optimised I/O code in a database engine, so I know it works, but it requires some ugly Glibc-and-architecture-specific things. Doing it portably doesn't seem to be an option.
The problem with this argument is that the set of programs that just fork() and then exec() is fairly small. Sure, shells are small and do this, but then the article argues that shells are a good use of fork().
In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done (maybe you want to create a new pid ns, you need a separate mm because you're going to allocate a bunch, whatever). Maybe the argument is that programs should never do this? I don't buy that. Then there's a lot of string-slinging through exec().
That's backwards from my experience, which is that most users of fork() only do "fork; child does small amount of setup, eg closing file descriptors; exec". Shells are one of the few programs that do serious work in the child, because the POSIX shell semantics surface "create a subshell and do..." to the shell user, and then the natural way to implement that when you're evaluating an expression tree is "fork, and let the child process continue evaluating as a long-lived process continuing to execute as the same shell binary". (Depending on what's in that sub-tree of the expression, it might eventually exec, but it equally might not.)
Many years back I worked on an rtos that had no fork(), only a 'spawn new process' primitive (it didn't use an MMU and all processes shared an address space, so fork would have been hard). Most unixy programs were easy to port, because you could just replace the fork-tweak-exec sequence with an appropriate spawn call. The shells (bash, ash I think were the two I looked at) were practically impossible to port -- at any rate, we never found it worth the effort, though I think with a lot of effort and willingness to carry invasive local patches it could have been done.
The vast majority of programs that fork are doing fork() followed almost immediately by exec(), to the extent that on macOS for example a process is only really considered safe for exec() after fork() happens. Pretty much nothing else is considered safe.
Yeah; that would be my assumption too. I worked one time on a significant project that benefit from fork() without exec() and it was a monstrous pain - only if you own every single line of code in your project, have centralized resource management, and have no significant library dependencies should you ever consider doing this.
1 reply →
Oh no, there's tons of ProcessBuilder type APIs in Java, Python, and... every major language you can think of.
The problems with fork() become very apparent in any Java apps that try to run external programs, especially in apps that have many threads and massive heaps and are very busy.
> In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done
That's usually going to be done with clone() instead, no? You'll likely want to fiddle with the various flags for those usages and are unlikely to be happy with what fork() otherwise does.
Microsoft Research has a paper about the very same issue (2019): https://www.microsoft.com/en-us/research/publication/a-fork-...
It's a very good paper, yeah. I will link it from the gist.
That paper smacks of a Chesterton Fence. They haven't come up with a tested replacement for many of the use cases, i.e.:
yet bullet #1 in the next paragraph is
I think this is a case of security guys being upset about fork gumming-up their experiments. I don't really care about their experiments. The security regime for the past 20 years may have bought us a little more security against eastern bloc hackers, but it hasn't done squat to protect us from Apple, Google, & Microsoft! I have never had a virus de-rail my computing life as much as the automatic Windows 10 upgrade. Robert Morris got 400 hours community service for a relatively benign worm. If that's the penalty scale, Redmond should get actual time in the slammer for Cortana, forced Windows Update, and adding telemetry to Calculator.
You fail to address any of the substance of their paper, or of my gist (TFA), then go on a rant about unrelated things. The authors of that paper deserve better treatment even if you hate Microsoft.
3 replies →
I have to disagree that fork is evil. fork is great because of copy-on-write. I guess my particular use case is not very typical/common though.
I'm running powerflow simulations on a power grid model (several GB of memory to store the model). Copy-on-write means I can make small modifications to this model and run simulations in parallel. Thanks to fork/copy-on-write, I can run 32 simulations in parallel, each will small modifications without requiring 32 times as much memory.
Neat!
I saw a bug once where an application would get way slower on MacOS after calling fork(). Not just temporarily either; many syscalls would continue to run slowly from the call to fork() until the process exited.
Looking on Stack Overflow, I see a few reports of this behavior[0][1].
[0]: https://stackoverflow.com/questions/4411840/memory-access-af...
[1]: https://stackoverflow.com/questions/27932330/why-is-tzset-a-...
I don't think containers should be like jails. Containers should be more like chroots than they are now.
Have you ever tried to run a modern X/whatever app with 3D graphics and audio and DBUS and God knows what else in a container and get it to show up on your desktop? It's a fucking nightmare. I spent over a week trying to get 1Password to run in a container. Somebody decided containers had to be "secure", even though they don't actually exist as a single concept and security was never their primary purpose. If instead containers were used only to isolate filesystem dependencies, we could actually pretend containers were like normal applications and treat them with the same lack of security concern that all the rest of our non-containerized programs are.
Firecracker is the correct abstraction for isolation: a micro-VM. That is the model you want if you want to run an app securely (not to mention reliably, as it can come with its own kernel, rather than needing you to run a compatible host kernel).
I... didn't mean that containers have to have a copy of the operating system inside them, systemd and many other things included. I meant only that they should be created in ways like how the BSDs and Illumos do it.
Is it a fair point to implement first with fork() because of memory protection, then optimize by using benchmarks and potentially vfork() for speed? Benchmark areas can look at synchronous locks, copy-on-write memory, stack sharing, etc.
What are the good practices of security tradeoffs of fork() vs. vfork() especially in terms of ease of writing correct code? I'd thought that fork() + exec() tends to favor thinking about clearer separation/isolation. For example I've written small daemons using fork() + exec() because it seems safe and easy to do at the start.
In short, fork() mixes poorly with multi-threaded code (and has some security footguns like needing to explicitly unshare elements of environment which may be sensitive, such as file descriptors (suddenly you need to know all the file descriptors used in the whole program from a single place in code)). Here is a well-written comment about fork() from David Chisnall: <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>
Additionally, the fork()+exec() idiom practically forces OS designers into a corner where they simply have to implement Copy-on-Write for virtual memory pages, or otherwise the whole userspace using this idiom is going to be terribly slow. Without the fork()+exec() idiom you don't need CoW to be efficient.
Fork mixes so poorly with multithreaded code that a lot of modern languages that are built from the beginning with threads of one sort or another in mind, like Go, simply won't let you do it. There is no binding to fork in the standard library.
I think you could bash it together yourself with raw syscalls, because that can't really be stopped once you have a syscall interface, but basically the Go runtime is built around assuming it won't be forked. I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out. The lowest level option given in the syscall package is ForkExec: https://pkg.go.dev/syscall#ForkExec And this is a package that will, if you want, create new event loops outside of the Go runtime's control, set up network connections outside of the runtime's control, and go behind the runtime's back in a variety of other ways... but not this one. If you want this, you'll be looking up numbers yourself and using the raw Syscall or RawSyscall functions.
1 reply →
TL;DR if another thread is holding a lock when you fork that lock will be stuck locked in the child, but that thread that was using that lock no longer exists.
So if your multi-threaded program uses malloc you may fork while a global allocation lock is being held and you won't be able to use malloc or free in the child (thread-local caches aside).
There are other problems but this is the basic idea. To be fork-safe you need to allow any thread to just disappear (or halt forever) at any point in your program.
7 replies →
fork came first; it's POSIX threads that is a bolted on piece of clunk that mixes badly with fork, signal handlers, chdir, ...
Apologies if this is a silly question, but it seems like there's a false dichotomy here:
(1) You have separate fork() (etc.) and exec(), so that in the brief window in between you can set all the properties of the new process using APIs that exist anyway for controlling your own process.
(2) You have a single call to spawn a new process, but you have a million different options to control every aspect of the new process.
Why not do it this other way instead? Perhaps a bit late now but seems like in retrospect it would give the API simplicity of fork+exec without any of the complications.
(3) There are two steps to run a new process. The first fully sets up its memory and returns a PID, but doesn't start running it. The second call, unfreeze(), allows it to begin executing code. All the usual APIs that exist anyway for controlling your own process take an extra parameter specifying the PID of a frozen child (or -1 for the current process).
There is something about fork which I have never understood. Maybe someone here can explain it to me.
Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork). If you know that you're going to call exec immediately after fork, then all the issues of dealing with the (potentially large) address space of the parent just evaporate because the child process is just going to immediately discard it all.
So why is there not a fork-exec combo? And why has it not replaced fork for 99% of use cases?
And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
None of this makes sense to me.
Because there are many, many use cases where you don't want to call exec() immediately after fork().
Want to constrain memory usage or CPU time of an arbitrary child process? You have to call setrlimit() before exec(). Privilege separation? Call setuid() before exec(). Sandbox an untrusted child process in some way? Call seccomp() (or your OS equivalent) before exec(). And so on and so forth. Any time you want to change what OS resources the child process will have access to, you'll need to do some set-up work before invoking exec().
Windows solves this by adding a bunch of optional parameters to CreateProcess, as well as having two more variants (CreateProcessAsUser and CreateProcessWithLogon). Some of the arguments are complicated enough that they have helper functions to construct them.
I like the more composable fork()->modify->exec() approach of unix, but I wouldn't call either of them really elegant.
3 replies →
To me this feels like a call for more powerful language primitives. i.e. a way to specify some action to take to "set up" the child process that's more explicit and readable than one special behaving in a particularly odd way. I'm imagining closures with some kind of Rust-like move semantics, but not entirely sure.
(if we're speaking in terms of greenfield implementation of OS features)
2 replies →
But my child processes are not arbitrary or untrusted, they're hard-coded and written by me!
I'm not writing a shell, I'm writing an application!
Dennis Richie addresses this in a history of early Unix: https://www.bell-labs.com/usr/dmr/www/hist.html
"Process control in its modern form was designed and implemented within a couple of days. It is astonishing how easily it fitted into the existing system; at the same time it is easy to see how some of the slightly unusual features of the design are present precisely because they represented small, easily-coded changes to what existed. A good example is the separation of the fork and exec functions. The most common model for the creation of new processes involves specifying a program for the process to execute; in Unix, a forked process continues to run the same program as its parent until it performs an explicit exec. The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else."
OK, but why has it not be replaced with something better in the intervening 50 years? There have been a lot of improvements to unix since 1970. Why not this?
4 replies →
There is exactly a fork-exec combo like that: it's called posix_spawn(): https://man7.org/linux/man-pages/man3/posix_spawn.3.html
I think the reason for fork() and exec() as primitives goes back to the early days Unix design philosophy. Unix tends to favour "easy and simple for the OS to implement" rather than "convenient for user processes to use". (For another example of that, see the mess around EINTR.) fork() in early unix was not a lot of code, and splitting into fork/exec means two simple syscalls rather than needing a lot of extra fiddly parameters to set up things like file descriptors for the child.
There's a bit on this in "The Evolution of the UNIX Time-Sharing System" at https://www.bell-labs.com/usr/dmr/www/hist.html -- "The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else." It says the initial fork syscall only needed 27 lines of assembly code...
(Edit: I see while I was typing that other commenters also noted both the existence of posix_spawn and that quote...)
> Unix tends to favour "easy and simple for the OS to implement"
Well, yeah, but the whole problem here, it seems to me, is that fork is not simple to implement precisely because it combines the creation of the kernel data structures required for a process with the actual initiation of the process. Why not mkprocess, which creates a suspended process that has to be started with a separate call to exec? That way you never have to worry about all the hairy issues that arise from having to copy the parent's process memory state.
2 replies →
> why would anyone ever want fork as a primitive
Long ago in the far away land of UNIX, fork was a primitive because the primary use of fork was to do more work on the system. You likely were one of thee or four other people, at any given moment vying for CPU time, and it wasn't uncommon to see loads of 11 on a typical university UNIX system.
> so why is there not a fork-exec combo
you're looking for system(3). Turns out, most people waitpid(fork()). Windows explicitly handles this situation with CreateProcess[0] which does a way better job of it than POSIX does (which, IMO, is the standard for most of the win32 API, but that's a whole can of worms I won't get into).
> why would anyone ever use vfork?
Small shells, tools that need the scheduling weight of "another process" but not for long, etc. See also, waitpid(fork()).
When you have something with MASSIVE page tables, you don't want to spend the time copying the whole thing over. There's a huge overhead to that.
[0] https://docs.microsoft.com/en-us/windows/win32/api/processth...
system(3) is not a good alternative because it indirects through the shell, which adds the overhead of launching the shell as well as the danger of misinterpreting shell metacharacters in the command if you aren’t meticulous about escaping them correctly.
`fork` is a classic example, as others have mentioned, as something that was implemented because it was [at the time] easy rather than because it was a good design. In the decades since, we've found there are issues that are caused by the semantics of fork, especially if the most common subsequent system call is `exec`.
If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
POSIX specifies a fork+exec combo called posix_spawn. This is actually used a fair amount, but the reason it isn't used more is because it doesn't support all of the eleventy-billion parameters governing all of the attributes of the spawned process. Instead, these parameters are usually set by calling system calls that change these parameters between fork and exec. These system calls might, for example, change the root directory of a process or attach a debugger. Neither of these are supported by posix_spawn, which only allows the common operations of changing the file descriptors or resetting the signal mask in the list of actions to do.
And this suggests why you might want vfork: vfork allows you write something that looks like posix_spawn: you get to fork, do your new-process-attribute-setting-flags, and then exec to the new process image, all while being able to report errors in the same memory space.
> If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
Or if you happen to be sane you'll have a single, simple system call to create a blank, suspended child process, and all the regular system calls which operate on process state will take a handle or process "file descriptor" to indicate which process to modify rather than assuming the current process as the target.
This was the ultimate flaw of posix_spawn(). As you point out it doesn't support all the things you might want to tweak in the child process—a consequence of trying to capture every aspect of the initial process state in a single process-creation API rather than distributing the work through the normal system calls so that each new interface or state can be adjusted for child processes in the same way that it's adjusted for the current process.
Whatever you do, though, make sure it's possible to emulate fork() reliably with your "better" replacement. Consider the case of Cygwin where emulated fork() calls can (and frequently do) fail in bizarre ways because the "blank" child process was pre-loaded with some unexpected virtual memory mapping by AV software or other system tasks, with the result that a required DLL or private memory space can't be set up at same address used in the parent.
2 replies →
> Why would anyone ever want fork as a primitive?
fork() without exec() can make sense in the context of a process-per-connection application server (like SSH). I've also used it quite effectively as a threading alternative in some scripting languages.
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(). Like a lot of POSIX APIs, it's kind of overcomplicated, but it does solve a lot of the problems with fork/exec.
> And as long as I'm asking stupid questions, why would anyone ever use vfork?
For processes with a very large address space, fork() can be an expensive operation. vfork() avoids that, so long as you can guarantee that it'll immediately be followed by an exec().
fork with copy-on-write semantics avoids copying the whole address space. It does have to copy some data structures that manage virtual memory and maybe the first level of the paging structure(page directory or whatever).
3 replies →
From "Operating Systems: Three Easy Pieces" chapter on "Process API" (section 5.4 "Why? Motivating The API") [1]:
[1] https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf
As an explanation it doesn't make much sense, because there are other ways to alter the environment of the about-to-be-run program (see any non-Unix OS for examples).
Because "fork" was easy to implement in UNIX on the PDP-11.
The original implementation was for a machine with very limited memory. So fork worked by swapping out the process. But then, instead of releasing the in-memory copy, the kernel duplicated the process table entry. So there were now two copies of the process, one in memory and one swapped out. Both were runnable, even if there wasn't enough memory for both to fit at once. Both executed onward from there.
And that's why "fork" exists. It was a cram job to fit in a machine with a small address space.
> So why is there not a fork-exec combo?
posix_spawn
> Why would anyone ever want fork as a primitive?
With fork you can very easily write a sever like mini_httpd:
https://acme.com/software/mini_httpd/
Or, in Unix shells:
here, the shell must fork a process (without exec) to run one of these functions.
For instance function1 might run in a fork, the grep is a fork and exec of course, and function2 could be in the shell's primary process.
In the POSIX shell language, fork is so tightly integrated that you can access it just by parenthesizing commands:
Everything in the parentheses is a sub-process; the effect of the cd, and any variable assignments, are lost (whether exported to the environment or not).
In Lisp terms, fork makes everything dynamically scoped, and rebinds it in the child's context: except for inherited resources like signal handlers and file descriptors.
Imagine every memory location having *earmuffs* like a defvar, and being bound to its current value by a giant let, and imagine that being blindingly efficient to do thanks to VM hardware.
I use fork a lot in my Python science programs. It's really great - you can stick it in a loop and get immediate parallelism. It's much better than multiprocessing, etc, as you keep the state from just before the fork happened, so you can share huge data structures between the processes, without having to process the same data again or duplicate them. I've even written a module for processing things in forked processes: https://pypi.org/project/forkqueue/
Splitting fork and exec allows you to do stuff before calling exec, for example redirecting file descriptors (like stdin/out/err), creating a pipe, modifying the child's environment, and so on.
(This is particularly useful for shells.)
These can all be made a part of the combined fork+exec API.
2 replies →
> Why would anyone ever want fork as a primitive?
> So why is there not a fork-exec combo?
There are so many variations to what you can do with fork+exec that designing a suitable "fork-exec combo" API is really difficult, so any attempts tend to yield a fairly limited API or a very difficult-to-use API, and that ends up being very limiting to its consumers.
On the flip side, fork()+exec() made early Unix development very easy by... avoiding the need to design and implement a complex spawn API in kernel-land.
Nowadays there are spawn APIs. On Unix that would be posix_spawn().
> And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
(Not a stupid question.)
You'd use vfork() only to finish setting up the child side before it execs, and the reason you'd use vfork() instead of fork() is that vfork()'s semantics permit a very high performance implementation while fork()'s semantics necessarily preclude a high performance implementation altogether.
Well, fork() is simple. No args, simple semantics.
Flexibility; you can set up pipes.
> why is there not a fork-exec combo
There is, the spawn calls mentioned.
I think it's actually a pretty useful primitive for doing multiprocessing. Unlike threading, you have a completely separate memory space both for avoiding data races and performance (memory allocators still aren't perfect and weird stuff can happen with cache lines). Unlike exec after fork or anything equivalent, you still get to share things like file descriptors and read only memory for convenience.
> Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork).
If you eliminate fork, then what do you do for those 1% of cases where you actually do need it? I agree that it's uncommon, but I have written code before that calls fork() but then does not exec().
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(3).
> And why has it not replaced fork for 99% of use cases?
Even though it's been around for about 20 years, it's still newer than fork+exec, so I assume a) many people just don't know about it, or b) people still want to go for maximum compatibility with old systems that may not have it, even if that's a little silly.
Lacking fork(), if you want to multi-process a service, you have to spawn (vfork()+exec() or posix_spawn(), or whatever) the processes and arrange for them to get whatever state and resources they need to start up. It's a pain, but I've done it.
You might want to move around some file descriptors if you don't want the child process to inherit your stdin/stdout/stderr (e.g. if you want to read the stdout of the process you launched, or give it some stdin).
And there does exist such a fork-exec combo - posix_spawn. It allows adding some "commands" of what file descriptor operations to do between the fork & exec before they're ever done, among some other things. But, as the article mentions, using it is annoying - you have to invoke various posix_spawn_file_actions_* functions, instead of the regular C functions you'd use.
> 99% of the time you immediately call exec after fork
What about forking servers? listen() and then immediately fork() to handle the inbound connection? Those don't need exec.
Also daemons. It's a common pattern to ditch permissions and then fork(), as per the old "Linux Daemon Writing HOWTO".
You can vfork()+exec(), why not? Exec too expensive? You can prefork[0].
Do people really do that? It sounds like a huge DOS vulnerability to me.
>So why is there not a fork-exec combo?
There is, posix_spawn.
The whole idea of fork is strange - the design pattern of "child process is executing exactly where the parent process is executing" is foreign to me. Don't we want to direct where the child process is executing? Like, when creating a thread? Why is fork() so conceptually orthogonal to that? Is there a good reason? A historical reason?
I don't find fork() to be obvious or useful or natural. I work hard to never do it.
fork()–exec() separation indeed exists for historical reasons: https://www.bell-labs.com/usr/dmr/www/hist.html
Search for the phrase "Process control in its modern form was designed and implemented within a couple of days."
It makes creating processes easy to me, when you did understand how it works:
No need to do complex things to start a new process, having to pass argument to it in some way, etc.
Oh I understand how it works. I implemented it, in the first POSIX implementation. I just don't get how anybody wants to do that.
Yes, there's the example right there. But it shows the awkwardness immediately - decoding what the f happened by checking a side effect (is pid == 0? wtf?)
How about spoon(handle_connection, ...) or something like that? See how much better?
1 reply →
If you want the child to start executing some other code but you have fork(), it's easy to do it yourself by calling that function.
But on the other hand, if you do want the child to execute code at the same place as the parent, but a hypothetical fork() asks you to provide a function pointer, it would be a bit more complicated.
It's a leaky abstraction and everything it does can be done manually, and possibly better. It exists purely because, at some point in the past, threads didn't exist.
If you design your program without fork, you'll probably end up with a cleaner and faster solution. Some things are best forgotten or never learned in the first place.
Can it though?
The beauty of (v)fork(+exec) is that it doesn't need a new interface for configuring the environment in whichever way you want before the other process starts. Instead you get to use the exact same means of modifying the environment to your needs, and once it's done, you can call exec and the new process inherits those things.
I mean, just look at the interface of posix_spawn.
I grant though that this isn't without its problems (including performance) and IMO e.g. FD_CLOEXEC is one example of how those problems can be patched up. It's like the reverse problem: you have too wide implicit interface in it, and then you need to come up with all these ways to be explicit about some things.
Add to that, fork is (was) very inefficient. You had to duplicate the entire process state (page tables etc). Then the damn program would exec(), and you would tear it all down again. Took 100ms on older computers. Complete waste.
We would resort to making a weak copy, with page tables faulting in only if you used them. A lot of drama, so the user could make a goofy call that they didn't really want most of the time.
A thread is not the same thing of a process. There are situations where you are fine with a thread, other where you need a process.
Think of it as the CS equivalent of cell division and differentiation in biology.
Another option is to allow the parent to create an empty child process, and then make arbitrary system calls and execute code in the child, like a debugger does. In most cases the last "remote system call" would be exec.
posix_spawn() essentially is like that, or can be, as an implementation detail.
One use case for fork()--which is used extensively on Android--is to build an expensive template process that can then be replicated for later work, which is exactly what people often want for the behavior with virtual machines. I wrote an article on the history of linking and loading optimizations leading up to how Android handles their "zygote" which touches on this behavior.
http://www.cydiasubstrate.com/id/727f62ed-69d3-4956-86b2-bc0...
We had the case that some library we were using (OpenBLAS) used pthread_atfork. Unfortunately, the atfork handler behaved buggy in certain situations involving multiple threads and caused a crash. This was annoying because we basically did not need fork at all but just fork+exec (for various other libraries spawning sub processes), where those atfork handlers would not be relevant.
Our solution was to override pthread_atfork to ignore any functions, and in case this is not enough, also fork itself to just directly do the syscall without calling the atfork handlers.
https://github.com/tensorflow/tensorflow/issues/13802 https://github.com/xianyi/OpenBLAS/issues/240 https://trac.sagemath.org/ticket/22021 https://bugs.python.org/issue31814 https://stackoverflow.com/questions/46845496/ld-preload-and-... https://stackoverflow.com/questions/46810597/forkexec-withou...
posix_spawn() shouldn't call atfork handlers. It's allowed to call them or not call them because implementors can use fork(), which must call them, or they can use vfork(), which must not call them -- or they can make posix_spawn() a proper system call, too, or they can use clone(), or my putative avfork(), or whatever.
If you used vfork(), you wouldn't have had this problem.
Fork-safety issues arise mainly because of the sharing of resources between the parent and child. pthread_atfork() exists mainly to allow libraries to add a measure of fork-safety by letting them disable things on the child-side of fork() or re-set-up things on the child-side of fork(). For example, a PKCS#11 provider might need to create a new connection to the tokens and re-C_Login() to them (except, since it really can't quite do that, most likely it must render every session inoperable on the child-side). (Indeed, PKCS#11 specifically mandates that on the child-side of fork all sessions must be dead and must not be used.)
I left a comment on TF #13802.
The good/evil/etc. here seem to be defined exclusively around "performance above all else", and - more specifically - performant primitives over performant application architecture.
It strikes me that performance gains associated with sharing address space & stack are similar to many performance gains: trade-offs. So calling them "good" and "evil" when performance is seemingly your sole goal and interest seems a bit forward.
In my world we often say things like "X is the moral equivalent of Y" where X and Y are just technologies and, clearly, are morally-neutral things.
Why do we do this? Well, because it adds emphasis, and a dash of humor.
Clearly fork() is neither Good nor Evil. It's morally neutral. It has no moral value whatsoever. But to say "fork() is evil" is to cause the audience to raise their eyebrows -"what, why would you say fork() is evil?!"- and maybe pay attention.
Yes, there is the risk that the audience might react dismissively because fork() obviously is morally-neutral, so any claim that it is "evil" must be vacuous or hyperbolic. It's a risk I chose to take.
Really, it's a rhetorical device. I think it's pretty standard. I didn't create that device myself -- I've seen it used before and I liked it.
Morally-neutral does not equate to neutral insofar as I think most technologists consider some tech to be "good" and some to be "bad" in a practical sense.
"Good -vs- evil" is obviously hyperbolic - particularly the latter - but outside of morals they still imply a tendency to be technically/practically good or bad in an objective sense. So discounting it as a mere rhetorical device seems overly dismissive.
Fork() is the second worst idea in programming, behind null pointers. Fork() is the reason overcommit exists, which is the reason my web browser crashes if I open too many tabs, and the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library. It's a clear example of "worse is worse", and we should have switched to the Microsoft Windows model decades ago.
Here's a paper from Microsoft Research supporting this point of view:
https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
> the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library
Linux overcommitment is often cited as an argument for the "panic on OOM" design of the allocating parts of the Rust standard library, and it's an important part of the story. But I think even if the Linux defaults were different, Rust would still have gone with the same design. For example, here's Herb Sutter (who works for Microsoft) arguing that C++ would benefit from aborting on allocation failure: https://youtu.be/ARYP83yNAWk?t=3510. The argument is that the vast majority of allocations in the vast majority of programs don't have any reasonable options for handling an alloc failure besides aborting. For languages like C++ and Rust, which want to support large, high-level applications in addition to low-level stuff, making programmers litter their code with explicit aborts next to every allocation would be really painful.
I think it's very interesting that Zig has gone the opposite direction. It could be that writing big applications with lots of allocs ends up feelign cumbersome in Zig, or it could be that they bend the curve. Fingers crossed.
Why overcommit is a problem? A program is unlikely to use all the memory that it allocates, or use it only at a later time. It would be a waste to not have it, it would mean having a ton of RAM that never gets used because a lot of programs allocates more ram that they will probably ever need. And it would be inefficient, costly and error prone to use dynamic memory allocation for everything.
The cause of your browser crash is not the overcommit, is simply the fact that you have not enough memory. If you disable overcommit (something you can do on Linux) you would the same crash earlier, before you allocated (not necessary used) 100% of your RAM (because really no software handles the dynamic memory fail condition, i.e. malloc returning null, that you can't handle reasonably).
Null pointers are not a mistake, how do you signal the absence of a value otherwise? How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)? null makes a perfect sense to be used as a value to signal "this pointer doesn't point to something valid".
Microsoft saying that fork() was a mistake... well, of course, because Windows doesn't have it. fork was a good idea and that is the reason why it's still used these days. Of course nowadays there are evolution, in Linux there is the clone system call (fork is deprecated and still there for compatibility reasons, the glibc fork is implemented with the clone system call). But the concept of creating a process by cloning the resources of the parent is something that to me always seamed very elegant to me.
In reality fork is something that (if I remember correctly, I don't have that much experience in programming in Windows) doesn't exist on Windows, and the only way to create a new process of the same program is to launch the executable, and pass the parameters from the command line, that is not that great for efficiency at all, and also can have its problems (for example the executable was deleted, renamed, etc while the program was running). Also in Windows there is neither the concept of exec, tough I think it can be emulated in software (while fork can't).
To me it makes perfect sense to separate the concept of creating a new process (fork/clone) and loading an executable from disk (exec). It gives a lot of flexibility, at a cost that is not that high (and there are alternatives to avoid it, such as vfork or variations of the clone system call, or directly higher level API such as posix_spawn).
I think much of the confusion around nulls stems from the fact that in mainstream languages pointers are overloaded for two purposes: for passing values by reference, and for optionality.
Nearly every pointer bug is caused by the programmer wanting one of these two properties, and not considering the consequences of the other.
Non-nullable references and pass-by-value optionals can replace many usages of pointers.
2 replies →
>How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)?
Rust does this with the Result and Option "enums", which are internally implemented as tagged unions. From my understanding the only overhead with this implementation is the size taken by the tag and then any padding required for alignment.
It also helps that references in Rust are not nullable and working with pointers is fairly rare, so the type system can do a lot of heavy lifting for you rather than putting null checks all over the place. When you have &T you never have to worry about handling null in the first place!
>Null pointers are not a mistake
The inventor, Tony Hoare, famously called them his "billion-dollar mistake". The better way to do it is with nullable types (which could internally represent null as 0 as a performance optimization). This is something Rust gets right.
3 replies →
Interesting take. If you don't mind explaining, what is the MS Windows model in in this context?
You opt into inheriting specific contexts from the parent, instead of copying everything by default:
https://docs.microsoft.com/en-us/windows/win32/api/processth...
7 replies →
Windows doesn't have fork as you know it. It has a POSIX-ish fork-alike for compliance, but under the hood it's CreateThread[0] with some Magic.
in Windows, you create the thread with CreateThread, then are passed back a handle to that thread. You then can query the state of the thread using GetExitCodeThread[1] or if you need to wait for the thread to finish, you call WaitForSingleObject [2] with an Infinite timeout
Aside: WaitForSingleObject is how you track a bunch of stuff: semaphores, mutexes, processes, events, timers, etc.
The flipside of this is that Windows processes are buckets of handles: a Process object maintains a series of handles to (threads, files, sockets, WMI meters, etc), one of which happens to be the main thread. Once the main thread exits, the system goes back and cleans up (as it can) the rest of the threads. This is why sometimes you can get zombie'd processes holding onto a stuck thread.
This is also how it's a very cheap operation to interrogate what's going on in a process ala Process Explorer.
If I had to describe the difference between Windows and Linux at a process model level, I have to back up to the fundamental difference between the Linux and Windows programming models: Linux is is a kernel that has to hide its inner workings for its safety and security, passing wrapped versions of structures back and forth through the kernel-userspace boundary; Windows is a kernel that considers each portion of its core separated, isolated through ACLs, and where a handle to something can be passed around without worry. The windows ABI has been so fundamentally stable over 30 years now because so much of it is built around controlling object handles (which are allowed to change under the hood) rather than manipulation of of kernel primitives through syscalls.
Early WinNT was very restrictive and eased up a bit as development continued so that win9x software would run on it under the VDM. Since then, most windows software insecurities are the result of people making assumptions about what will or won't happen with a particular object's ACL.
There's a great overview of windows programming over at [3]. It covers primarily Win32, but gets into the NT kernel primitives and how it works.
A lot of work has gone into making Windows an object-oriented kernel; where Linux has been looking at C11 as a "next step" and considering if Rust makes sense as a kernel component, Windows likely has leftovers of Midori and Singularity [4] lingering in it that have gone onto be used for core functionality where it makes sense.
[0] https://docs.microsoft.com/en-us/windows/win32/api/processth... [1] https://docs.microsoft.com/en-us/windows/win32/api/processth... [2] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/... [3] https://www.tenouk.com/cnwin32tutorials.html [4] https://www.microsoft.com/en-us/research/project/singularity...
Overcommits exist any time you can have a debugger anyways.
fork() was a brilliant way to make Unix development easy in the 70s: it made it trivial move a lot of development activity out of the kernel and into user-land.
But with it came problems that only became apparent much later.
Agreed about overcommit and resulting mess.
unpopular opinion: null pointers (in at least java and c) are the single greatest metaphor in software development, and are the CS analog to the invention of zero
There was an article about exceptions the other day that lamented that exceptions are high latency because the exceptional path will be paged out. I would assume overcommit is to blame for that too.
That's probably a caching issue, and caching issues are a fact of life for the foreseeable future. (Could also be a disk swap issue, but probably not.)
Why would you assume that..?
1 reply →
"I won't bother explaining what fork(2) is -- if you're reading this, I assume you know.", If that applied to everything I looked at from HN I'd read precious little.
I didn't write it for HN. It wasn't a paper to publish in some Computer Science journal. It was just a github gist. If you don't get the subject, it's not for you. I might well write a paper now based on it, and then it might be a good read for you, but I still won't be writing it for you, but for people who are interested in the topic. The intended audience is small, expert on the matter, and probably even more opinionated than I am.
I found the article well written and informative even though it's not my area of expertise, I intended my comment as a light hearted reflection of the fact that a lot of articles on HN go over my head but are still worth a read to me, just like your article.
For those saying to use posix_spawn: What am I supposed to make of the writeup in the posix_spawn manpage though?
"...specified by POSIX to provide a standardized method of creating new processes on machines that lack the capability to support the fork(2) system call. These machines are generally small, embedded systems lacking MMU support"
Is this why no one uses it? It has this gratuitous opinion piece at the beginning that makes people think it's just for embedded systems and my dad's Amiga?
That's just some injected opinion, I assume from someone contributing to glibc who doesn't like posix_spawn I guess? In any case it is wrong.
Don't assume what is written in man pages is the truth. Some of them have a lot of opinion added. It can be useful to cross-check man pages between systems - they don't always call out non-portable options or behavior.
On some kernels posix_spawn is a syscall or specifies flags that make it more efficient than fork+exec. Darwin is one such system, though you can use POSIX_SPAWN_SETEXEC if you still want to replace the current process with a new executable rather than creating a child.
Hah, that's pretty funny. Regardless of the motivation as written, the motivation I surmise is:
- some systems (e.g., Windows) lack fork() for various reasons
- vfork() is baaaad
- I know, let's do something like WIN32's spawn() or CreateProcess(), but, like, better
The middle item I have good reason to think is very likely. vfork() still has a bad rap from that old "vfork() Considered Dangerous" paper. That paper circulated a lot way back when, and was the reason vfork() was removed from some Unixes for a while (well, it was left as an alias of fork()) before it was eventually re-added. The Open Group participants would been very aware of that paper, and that is almost certainly the reason that POSIX says about vfork():
So if fork() can't perform well, and the committee won't recommend the use of vfork(), what shall the committee do? Answer: design and specify posix_spawn(). It's not an unreasonable answer. Though, IMO of course, they should have un-obsoleted vfork().
Meta comment: Github Gist seems to be great for blogging. Yeah, the UI is not very blog-specific, but it has all the useful features, and then some: markdown, comments, hosting, an index of all posts, some measure of popularity (stars), a very detailed edit history, etc.
All without having to pay or setup anything yourself.
Unfortunately, there's no way to turn off comments on a Gist, which makes it not a viable replacement for anyone who doesn't want to spend a lot of time processing and moderating comments.
Good point. However, you need a GitHub account to post comments so everyone knows who you are. Your reputation might suffer if you constantly post comments that require moderation.
1 reply →
It's great because Chinese GFW won't block.
This avfork implementation is poor. You don't want to make your single threaded programs multi-threaded. I don't really get the big benefit of afork over other existing mechanisms other than handwaving about things being evil.
Also,
> Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via libsocket on top of STREAMS proved to be a very long and costly mistake. Emulating one API from another API with impedance mismatches is difficult at best.
Linux does have a thread creation system call. It's clone(2). It literally creates new threads of execution with various properties. It does not "emulate" threads, it is threads.
> You don't want to make your single threaded programs multi-threaded.
Correct. I want to spawn processes faster from already-multi-threaded programs.
You do, but it's not a good implementation for a general API is all I was trying to say.
Do you really need an "asynchronous process creation" call? The rationale is that "blocking is bad", but a thread creation system call blocks the caller too until the thread is created. So it's not just "blocking", it's the amount of blocking if anything. Is posix_spawn or vfork+exec really too slow for your case?
Then multi-process and multi-threading seems like a reasonable solution. Asynchronous system calls are the exception not the rule in unix. So it wouldn't make sense as a traditional afork(2) system call. You could probably do a posix_spawn for io_uring, but do you really need to?
Some links I found today researching this:
- @famzah'z blog about fork vs vfork vs clone performance:
Concurrently running dupe currently on front page: https://news.ycombinator.com/item?id=30499169
:) :) :)
Ha!
The intent of fork() is to start a new process in its own address space. That *fork() variations that run in the SAME address space are confusing. A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages. But generally fork() is very specific from my recollection.
> The intent of fork() is to start a new process in its own address space.
True!
> That *fork() variations that run in the SAME address space are confusing.
Why is it confusing? They are distinct and different system calls, with different semantics. They are also sufficiently similar that they are also similarly named. But there's nothing confusing about their semantics. vfork() is not harder to use than fork() -- it's just subtly different.
> A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages.
I wouldn't expect that. Sandboxing is a large and complex topic.*
Amusingly vfork semantics differ across OSes. This program prints 42 in Linux but 1 on Mac: https://godbolt.org/z/jn7Gaf5Me because on Linux they share address space.
Unfortunately there was this paper from the 80s titled "vfork() Considered Dangerous", which led to BSDs removing vfork(), and then later it was re-added because that paper was clearly quite wrong. But the news hasn't quite filtered through to Apple, I guess.
I am pretty sure Mac OS doesn't COW fork(), and that the address space is copied. At least it was the last time I looked. FreeBSD and Linux both seem to COW.
Perhaps there's a reason vfork is different too.
My (very possibly wrong) understanding is that xnu does CoW fork but doesn't overcommit, meaning that memory must be reserved (perhaps in swap) in case the pages need to be duplicated.
There's other complications relating to inheriting Mach ports and the mach_task <-> BSD process "duality" in xnu, which Linux doesn't have. I'd love for someone to chime in who knows more about how this stuff works.
Hopefully someone at Apple will see this post and be convinced to restore vfork() and un-obsolete it.
I started with DOS, where spawn() is the norm, so I've always considered the fork()-like behaviour to be unusual yet handy for certain use-cases. Perhaps a system call that offers a combination of the two behaviours should be named spork().
Fork with cow is inefficient.
Compared to what? In what dimension? Any numbers on that? Where is the trade-off? To what extent does anyone need to care and on what circumstances?
> Fork with cow is inefficient.
> Compared to what?
vfork()
> Any numbers on that?
I added links to the gist, some of which discuss performance in detail. E.g., https://blog.famzah.net/tag/fork-vfork-popen-clone-performan... and https://bugzilla.redhat.com/show_bug.cgi?id=682922
But you can just reason about this:
O(1) beats O(N).
And O(N) is just the complexity of fork() for a single-threaded parent process. Now imagine a very busy, threaded, large-RSS process that forks a lot. You get threads and child processes stepping all over each other's CoW mappings, causing lots of page faults and copies. Ok, that is still O(N), but users will feel the added pain of all those page faults and TLB shootdowns.
Ok but you're just repeating "It's inefficient" and not saying in any way for what use is its inefficiency even noticeable. I want to reason about when I would care. You see?
The first link didn't even have units on its numbers(!) I assume they're milliseconds. When does that scale become something one would care about at all? Not launching a gui process. Not a shell pipeline. So when is this issue arising at all? What is being done that makes fork inefficiency anything other than academic interest. Must be something, right? Forking webserver?
5 replies →
It's inherently inefficient because while the child process does its initialization (pre-exec) stuff, the parent gets page faults for every thread writing into the memory due to COW. This will basically stall the parent and can cause funny issues.
Slightly off topic, how does Erlang handle this because isn’t it know for having extremely fast & cheap process spawning baked in (with isolation).
In another comment, I observe how Go doesn't even have a binding to fork.
Erlang is another example of that. There is no standard library binding to the fork function. If someone were to bash one into a NIF, I have no idea what would happen to the resulting processes, but there's no good that can come of it. (To use Star Trek, think less good and evil Kirk and more "What we got back, didn't live long... fortunately.") Despite the terminology, all Erlang processes are green threads in a single OS process.
> Despite the terminology, all Erlang processes are green threads in a single OS process.
The main Erlang runtime uses an M:N Erlang:native process model, not an N:1. So Erlang processes are like green threads (they are called processes instead of threads because they are shared-nothing), but not in a single process.
I mentioned this somewhere else but I thought Erlang does NOT share memory.
Doesn’t that make Erlang a bit unique. It was the ability to spawn a new process extremely fast AND also have memory isolation. This combination is what the OP was wanting to achieve.
1 reply →
Client space scheduler and processes. The isolation is a property of the VM and langage primitives (you just don’t get any way to share stuff, kinda).
Also Erlang is known for cheap and plentiful processes, not for being fast. It’s fast enough but it’s no speed demon.
My reference to “fast” was in the context of creating a new process due to the OP post talking about how long fork/etc can take. Not in reference to executing code itself.
1 reply →
m:n threads (aka green threads) https://twitter.com/joeerl/status/1010485913393254401
I realize that tweet is from the authority himself but am I mistaken in my understanding …
I thought green threads share memory but Erlang processes do NOT share memory, which is what makes Erlang so unique.
Did Erlang create a so called “green process”? If so, why can’t this model be implemented in the kernel?
1 reply →
Erlang processes aren't unix processes. They're more like coroutines.
The problem is clone is more of a start phase after vfork but before fork regardless for github. So it's kind of a bit strange that we call vfork first but that is about templates too.
As for templates they need to be in different languages and in different formats for video games consoles, and so many other formats they port systems and games that sort of work digitally to certain things but not playable to certain things too.
The other problem is that clone is part of syscall interfaces and part of apis and part of a lot of other things too.
Your idea good
Your idea stupid
I’m not woke by any means, idk what it is about low level programming but calling someone’s idea “stupid” is a really shitty thing to say.
“He chose to take it personally” is the type of lazy, pseudo-stoic argument I have no interest in reading.
Yes I’m having a morning, lol.
I answered this here: https://news.ycombinator.com/item?id=30504804
It's a rhetorical device. I didn't expect this to -years later- become a front-page item on HN. I wrote that to share with certain people.
And yes, clone() has some real problems, and if calling it "stupid" pisses off some people, but maybe also leads others to want to improve clone() or create a better alternative, then that's fine. If I'd wanted to write an alternative to Linux I'd probably have had to deal with the very, very fine language that Linus and others use on the Linux kernel mailing lists -- if you don't like my using the word "stupid", then you really shouldn't look there because you're likely to be very disappointed. Indeed, not only would I have to accept colorful language from reviewers there, I'd probably have to employ some such language myself.
TL;DR: clone() came from Linux, where "stupid" is the least colorful language you'll find, and me calling it "stupid" is just a rhetorical device.