← Back to context

Comment by BigTTYGothGF

4 days ago

"If N = 300, even a 256-bit seed arbitrarily precludes all but an unknown, haphazardly selected, non-random, and infinitesimally small fraction of permissible assignments. This introduces enormous bias into the assignment process and makes total nonsense of the p-value computed by a randomization test."

The first sentence is obviously true, but I'm going to need to see some evidence for "enormous bias" and "total nonsense". Let's leave aside lousy/little/badly-seeded PRNGs. Are there any non-cryptographic examples in which a well-designed PRNG with 256 bits of well-seeded random state produces results different enough from a TRNG to be visible to a user?

The argument against PRNGs this paper makes isn't that the PRNG produces results that can be distinguished from TRNG, but that the 256-bit seed deterministically chooses a single shuffling. If you need 300 bits to truly shuffle the assignment but you only have 256 bits, then that's a lot of potential assignments that can never actually happen. With this argument it doesn't matter what the PRNG is, the fact that it's deterministic is all that matters. And this invalidates the p-value because the p-value assumes that all possible assignments are equiprobable, when in fact a lot of possible assignments have a probability of zero.

I imagine you could change the p-value test to randomly sample assignments generated via the exact same process that was used to generate the assignment used by the experiment, and as you run more and more iterations of this the calculated p-value should converge to the correct value, but then the question becomes is the p-value calculated this way the same as the p-value you'd get if you actually went ahead and used equiprobable assignment to begin with?

Ultimately, this all comes down to the fact that it's not hard to use true randomness for the whole thing, and true randomness produces statistically valid results, if you use true randomness for assignment then you can't screw up the p-value test, and so there's no reason at all to even consider how to safely use a PRNG here, all that does is open the door to messing up.

  • If you have 300 bits of shuffling entropy, you have a lot of potential assignments that can never happen because you won't test them before the universe runs out. No matter how you pick them.

    Of course a PRNG generates the same sequence every time with the same seed, but that's true of every RNG, even a TRNG where the "seed" is your current space and time coordinates. To get more results from the distribution you have to use more seeds. You can't just run an RNG once, get some value, and then declare the RNG is biased towards the value you got. That's not a useful definition of bias.

    • The number of possible assignments has to be effectively close to an integer multiple of the number of shuffles.

      It doesn't matter how many universes it would take to generate all of them, there are some assignments that are less likely.

      5 replies →

  • And why does it matter in the context of randomly assigning participants in an experiment into groups? It is not plausible that any theoretical "gaps" in the pseudorandomness are related to the effect you are trying to measure, and unlikely that there is a "pattern" created in how the participants get assigned. You just do one assignment. You do not need to pick a true random configuration, just one random enough.

    I assume that as long as p-values are concerned, the issue raised could very well be measured with simulations and permutations. I really doubt though that the distribution of p-values from pseudorandom assignments with gaps would not converge very fast to the "real" distribution you would get from all permuations due to some version of a law of large numbers. A lot of resampling/permutation techniques work by permuting a negligible fraction of all possible permutations, and the distribution of the statistics extracted converges pretty fast. As long as the way the gaps are formed are independent of the effects measured, it sounds implausible that the p-values one gets are problematic because of them.

  • P-values assume something weaker than "all assignments are equiprobable." If the subset of possible assignments is nice in the right ways (which any good PRNG will provide) then the resulting value will be approximately the same.

  • Gallant always uses TRNGs. Goofus always uses a high-quality PRNG (CSPRNG if you like) that's seeded with a TRNG. Everything else they do is identical. What are circumstances under which Goofus's conclusions would be meaningfully different than Gallant's?

    Suppose I'm doing something where I need N(0,1) random variates. I sample from U(0,1) being sure to use a TRNG, do my transformations, and everything's good, right? But my sample isn't U(0,1), I'm only able to get float64s (or float32s), and my transform isn't N(0,1) as there's going to be some value x above which P(z>x)=0. The theory behind what I'm trying to do assumes N(0,1) and so all my p-values are invalid.

    Nobody cares about that because we know that our methods are robust to this kind of discretization. Similarly I think nobody (most people) should care (too much) about having "only" 256 bits of entropy in their PRNG because our methods appear to be robust to that.

  • > then you can't screw up the p-value test

    Bakker and Wicherts (2011) would like to disagree! Apparently 15 % screw up the calculation of the p-value.

So here's how I would think about it intuitively:

We can create a balanced partitioning of the 300 turkeys with a 300 bit random number having an equal number of 1's and 0's.

Now suppose I randomly pick 300 bit number, still with equal 0's and 1's, but this time the first 20 bits are always 0's and the last 20 bits are always 1's. In this scenario, only the middle 260 bits (turkeys) are randomly assigned, and the remaining 40 are deterministic.

We can quibble over what constitutes an "enormous" bias, but the scenario above feels like an inadequate experiment design to me.

As it happens, log2(260 choose 130) ~= 256.

> Are there any non-cryptographic examples in which a well-designed PRNG with 256 bits of well-seeded random state produces results different enough from a TRNG to be visible to a user?

One example that comes to mind is shuffling a deck of playing cards. You need approximately 225 bits of entropy to ensure that every possible 52 card ordering can be represented. Suppose you wanted to simulate a game of blackjack with more than one deck or some other card game with more than 58 cards. 256 bits is not enough there.

  • It's an interesting observation and that's a nice example you provided but does it actually matter? Just because certain sequences can't occur doesn't necessarily mean the bias has any practical impact. It's bias in the theoretical sense but not, I would argue, in the practical sense that is actually relevant. At least it seems to me at first glance, but I would be interested to learn more if anyone thinks otherwise.

    For example. Suppose I have 2^128 unique playing cards. I randomly select 2^64 of them and place them in a deck. Someone proceeds to draw 2^8 cards from that deck, replacing and reshuffling between each draw. Does it really matter that those draws weren't technically independent with respect to the larger set? In a sense they are independent so long as you view what happened as a single instance of a procedure that has multiple phases as opposed to multiple independent instances. And in practice with a state space so much larger than the sample set the theoretical aspect simply doesn't matter one way or the other.

    We can take this even farther. Don't replace and reshuffle after each card is drawn. Since we are only drawing 2^8 of 2^64 total cards this lack of independence won't actually matter in practice. You would need to replicate the experiment a truly absurd number of times in order to notice the issue.

    • If it had a practical impact, then it would imply that such statistical tests could be used as a distinguisher to attack the RNG. They fail as distinguishers, even with absolutely enormous amounts of data, so the bias is too small to have any influence in any practical experiment. You'd expect to need to observe 2^128 states to detect bias in a 256-bit CSPRNG, which means you'll have to store 2^128 observed states. That's around 10^20 EiB of storage needed. Good luck affording that with drive prices these days!

    • At a certain point a bias in the prng just has to become significant? This will be a function of the experiment. So I don’t think it’s possible to talk about a general lack of “practical impact” without specifying a particular experiment. Thinking abstractly - where an “experiment” is a deterministic function that takes the output of a prng and returns a result - an experiment that can be represented by a constant function will be immune to bias, while one which returns the nth bit of the random number will be susceptible to bias.

      1 reply →

  • > 256 bits is not enough there

    Yeah, but the question is: who cares?

    Suppose you and I are both simulating card shuffling. We have the exact same setup, and use a 256-bit well-behaved PRNG for randomness. We both re-seed every game from a TRNG. The difference is that you use all 256 bits for your seed, while I use just 128 and zero-pad the rest. The set of all shuffles that can be generated by your method is obviously much larger than the set that can be generated by mine.

    But again: who cares? What observable effect could there possibly be for anybody to take action if they know they're in a 128-bit world vs a 256-bit one?

    The analogy obviously doesn't generalize downwards, I'd be singing a different tune if it was, say, 32 bits instead of 128.

By the definition of a cryptographically secure PRNG, no. They, with overwhelming probability, produce results indistinguishable from truly random numbers no matter what procedure you use to tell them apart.

I think your intuition comes from the assumption that the experimental subjects are already coming to you in a random order. If that's the case, then you might as well assign the first half to control and the second half to treatment. To see the problem with poor randomization, you have to think about situations where there is (often unknown) bias or correlations in the order of the list that you're drawing from to randomize. Say you have an ordered list of 10 numbers, assigned 5 and 5 to control and (null) treatment groups. There are 252 assignments, which in theory should be equally likely. Assuming they all give different values of your statistic, you'll have 12 assignments with p <= .0476. If, say, you do the assignment from ~~a 256~~ an 8 bit random number such that 4 of the possible assignments are twice as likely as the others under your randomization procedure, the probability of getting one of those 12 assignments something between .0469 and .0625, depending whether the more-likely assignments happen to be among the 12 most extreme statistics, which is a difference of about 1/3 and could easily be the difference between "p>.05" and "p<.05". Again, if you start with your numbers in a random order, then this doesn't matter -- the biased assignment procedure will still give you a random assignment, because each initial number will be equally likely to be among the over-sampled or under-sampled ones.

Also worth noting that the situations where this matters are usually where your effect size is fairly small compared to the unexplained variation, so a few percent error in your p-value can make a difference.

  • > If, say, you do the assignment from a 256 bit random number such that 4 of the possible assignments are twice as likely as the others under your randomization procedure

    Your numbers don't make sense. Your number of assignments is way fewer than 2^256, so the problem the author is (mistakenly) concerned about doesn't arise--no sane method would result in any measurable deviation from equiprobable, certainly not "twice as likely".

    With a larger number of turkeys and thus assignments, the author is correct that some assignments must be impossible by a counting argument. They are incorrect that it matters--as long as the process of winnowing our set to 2^256 candidates isn't measurably biased (i.e., correlated with turkey weight ex television effects), it changes nothing. There is no difference between discarding a possible assignment because the CSPRNG algorithm choice excludes it (as we do for all but 2^256) and discarding it because the seed excludes it (as we do for all but one), as long as both processes are unbiased.

    • typo -- meant to say 8 bit random number i.e. having 256 possibilities, convenient just because the number of assignments was close to a power of 2. If instead you use a 248-sided die and have equal probabilities for all but 4 of the assignments, the result is similar but in the other direction. Of course there are many other more subtle ways that your distribution over assignments could go wrong, I was just picking one that was easy to analyze.

      4 replies →

MCMC can be difficult for this reason. There are concepts like "k-dimensional equidistribution" etc. etc... where in some ways the requirements of a PRNG are far, far, higher than a cryptographically sound PRNG, but also in many ways less so, because you don't care about adversarial issues, and would prefer speed.

If you can't generate all possible assignments, you care about second and third order properties etc. of the sequence.

  • Does there exist a single MCMC example that performs poorer when fed by a CSPRNG (with any fixed seed, including all zeroes; no state reuse within the simulation) as opposed to any other RNG source?

    • If there did, it'd be a distinguisher attack on that CSPRNG. So for a non-broken CSPRNG, the answer is "no", by the definition of "non-broken CSPRNG".

  • > There are concepts like "k-dimensional equidistribution" etc. etc... where in some ways the requirements of a PRNG are far, far, higher than a cryptographically sound PRNG

    Huh? If you can chew through however many gigabytes of the supposed CSPRNG’s output, do some statistics, and with a non-negligible probability tell if the bytes in fact came from the CSPRNG in question or an actual iid random source, then you’ve got a distinguisher and the CSPRNG is broken.

    • It all comes down to actual specific statistical tests, and how hard they are to break in specific applications.

      No CSPRNG is absolutely perfect, no CSPRNG has ever absolutely passed every statistical test thrown at it.

      In MCMC, it stresses very different statistical tests than the typical CSPRNG tests.

      Every PRNG is absolutely broken if you want to be absolute about it. MCMC and crypto applications push on different aspects where statistical issues will cause application level failures.

      See e.g. this paper https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf

      (it's not the end all be all, but it's a good survey of why this stuff matters and why it's different)

      15 replies →