Comment by Strilanc

4 years ago

Another danger using fork is it duplicates the internal state of pseudo random number generators. It's a great way to accidentally take the same random samples in every process, utterly trashing any statistics you were intending to do. Bonus: the python multiprocessing module silently uses fork by default. Person A writes a "make multiprocessing convenient" library, Person B writes a sampling library, you put them together and... whoops!.

Libraries like that should use pthread_atfork() to automatically reset/reseed/whatever state as needed at fork() time.

  • I don't think that's really a viable strategy in practice in an ecosystem as complex as python's. There's too many libraries and too many little corner cases and interactions around what the behavior should be.

    For example, suppose I am using library A and I initialized the random number generator with a fixed seed. Clearly when I fork it's not appropriate for A to reseed, because I wanted fixed behavior. Something is very wrong so probably there should be an exception. But now suppose I was using library B which was using A and B handles getting system entropy to seed A. Now it is clear that when I fork I probably want B to reseed A, but alas A has already raised an exception because it was given a (from its perspective) fixed seed. So now A needs to be redesigned to be given a seed and like some sort of intent on what should happen when forking, and oh my god wow this is creating a lot of work for everyone everywhere this is not actually going to be done consistently and cannot be trusted.

    • If you're writing a simulation or a test, then you'll want the PRNG to stay unchanged, and you'll want to be in control of any reseeding.

      For all other RNG uses, you really do want it to reseed.

      A cryptographic PRNG vs. a simulation PRNG are very different things, and should be different libraries.

Reading up in the python documentation, it seems to seed once from `/dev/urandom`, and then uses it's own generator to generate further random bits.

What's the purpose for this strategy opposed to deriving every single random value from `/dev/urandom`, simple performance?

  • Reading from /dev/urandom requires a syscall, which can be extremely slow compared to running your own prng in-process.