Comment by jmalicki

1 day ago

The number of possible assignments has to be effectively close to an integer multiple of the number of shuffles.

It doesn't matter how many universes it would take to generate all of them, there are some assignments that are less likely.

Everyone agrees that most of the possible shuffles become impossible when a CSPRNG with 256 bits of state is used. The question is just whether that matters practically. The original author seems to imply it does, but I believe they're mistaken.

Perhaps it would help to think of the randomization in two stages. In the first, we select 2^256 members from the set of all possible permutations. (This happens when we select our CSPRNG algorithm.) In the second, we select a single member from the new set of 2^256. (This happens when we select our seed and run the CSPRNG.) I believe that measurable structure in either selection would imply a practical attack on the cryptographic algorithm used in the CSPRNG, which isn't known to exist for any common such algorithm.

  • Yeah, you're discarding almost all permutations, but in an unbiased manner. It seems to imply not only an attack, but that your experimental results rely strongly and precisely on some extremely esoteric (otherwise it would've been found already) property of the randomization algorithm. If you can only detect the effect of television on turkeys when using a PRNG whose output is appropriately likely to have a high dimensional vector space when formated as a binary square matrix then I think you should probably go back to the drawing board.

The cases that are not close to a multiple are handled by the rejection of a part of the generated random numbers.

Let's say that you have a uniform random number generator, which generates with equal probability anyone of N numbers. Then you want to choose with equal probability one of M choices.

If M divides N, then you can choose 1 of M by either multiplication with taking the integer part, or by division with taking the remainder.

When M does not divide N, for unbiased choices you must reject a part of the generated numbers, either rejecting them before the arithmetic operation (equivalent to diminishing N to a multiple of M), or rejecting them after the arithmetic operation (diminishing the maximum value of the integer part of product or of the division remainder, to match M).

This is enough for handling the case when M < N.

When M is greater than N, you can use a power of N that is greater than M (i.e. you use a tuple of numbers for making the choice), and you do the same as before.

However in this case you must trust your RNG that its output sequence is not auto-correlated.

If possible, using from the start a bigger N is preferable, but even when that is impossible, in most cases the unreachable parts of the space of random number tuples will not make any statistical difference.

To be more certain of this, you may want to repeat the experiment with several of the generators with the largest N available, taking care that they really have different structures, so that it can be expected that whichever is the inaccessible tuple space, it is not the same.

  • This is correct, but for the author's example of randomizing turkeys I wouldn't bother. A modern CSPRNG is fast enough that it's usually easier just to generate lots of excess randomness (so that the remainder is nonzero but tiny compared to the quotient and thus negligible) than to reject for exactly zero remainder.

    For example, the turkeys could be randomized by generating 256 bits of randomness per turkey, then sorting by that and taking the first half of the list. By a counting argument this must be biased (since the number of assignments isn't usually a power of two), but the bias is negligible.

    The rejection methods may be faster, and thus beneficial in something like a Monte Carlo simulation that executes many times. Rejection methods are also often the simplest way to get distributions other than uniform. The additional complexity doesn't seem worthwhile to me otherwise though, more effort and risk of a coding mistake for no meaningful gain.