← Back to context

Comment by monocasa

4 years ago

Hard disagree to most of this.

fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.

vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.

clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.

And I don't see what afork gets you that clone doesn't, except afork isn't as general.

(This is a bit of a tangent, apologies.)

> fork(2) makes a lot more sense when you realize its heritage.

I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.

It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.

The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)

It's 100% the wrong API for spawning processes.

Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.

  • The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.

  • afork() could do some things differently. The point of afork() is to be able to spawn child processes (that will exec-or-_exit) faster.

The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.

I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?

  • > Are you talking about the PDP-7?

    Yes

    > Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

    clone was literally designed to support posix threads.

    > What's the L4 approach?

    Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.

    • I see. Maybe that explains why on PDP-7 Unix programs would exec the shell instead of terminating the process; swapping your process out to disk or tape can't have been very fast. But without an MMU what else could you do?

      Plan9 clone() was not designed to support POSIX threads; IIRC they didn't exist and Plan9 didn't support POSIX. Wasn't Linux clone() mostly a copy of it?

      The L4 approach sounds pretty reasonable; not as convenient as fork() in the common case but not as much of a pain as, I don't know, opening a pty or opening an X11 window. I guess L4 syscalls are a bit pricier post-Spectre. How are you going to handle atomicity in your one syscall?

      2 replies →

> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().

What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?

A: It just wouldn't happen.

And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:

    /* do a bunch of pre-vfork() setup */
    ...
    
    pid_t pid = vfork();
    
    if (pid == -1) err(1, "Couldn't vfork()");
    
    if (pid == 0) {
      /* do a bunch of child-side setup */
      execve(...);
      /* oops, ENOENT or something */
      _exit(1);
    }
    
    /* the child either exec'ed or exited */
    if (waitpid(pid, &status, 0) != pid) err(1, "...");
    
    ...

How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.

No, really, what you say about vfork() is lore, and very very wrong.

That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.

> And I don't see what afork gets you that clone doesn't, except afork isn't as general.

afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

  • > What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that

    Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.

    > No, really, what you say about vfork() is lore, and very very wrong.

    Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".

    > afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

    Ok, but I'm not going to hold it against clone for being a more general solution.

    > clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

    I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.

    A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.

    • > 'It's not a footgun, just don't make mistakes.'

      fork() -> fork bombs -> fork() is a footgun!

      You have to know how to use it. Yes. So what?

      > Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".

      Links or it didn't happen :)

      7 replies →

  • Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.

    For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.

    That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.

    • > Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking.

      Luckily a C compiler that doesn't know about concepts outside of the C Virtual machine will not be able to compile a Linux executable or even dynamically load a library that exposes the vfork call (let alone try to execute the underlying system call directly).

      2 replies →

  • Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..

    • This is bad:

          int exec_failed = 0;
          
          {
            some_type some_var;
          
            pid = vfork();
            if (pid == -1) err(1, "vfork() failed");
          
            if (pid == 0)        
              execve(...);
          
            /* oops, execve() failed */
            exec_failed = 1;
          }
          
          if (exec_failed)
            cleanup_code; /* bad! */
          
          /* parent */
      

      But, it's hard to write code like that instead of:

          pid = vfork();
          if (pid == -1) err(1, "vfork() failed");
          
          if (pid == 0) {
            execve(...);
            
            /* oops, execve failed */
            some_cleanup;
            _exit(1);
          }
          
          /* parent */
      

      You have to really try.

      1 reply →

    • If "[s]tack manipulations are a real problem" (I say there are none if you're writing the code and know not to add any problematic stack manipulations) then avfork() should satisfy that concern.

I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?

  • vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.

  • I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.

    • That is the argument for vfork(), and it's been the argument for it since it was incepted, decades ago.

If you really think vfork() is hard to use because of the stack sharing, the avfork() should be good for you!