← Back to context

Comment by adrian_b

20 hours ago

The cases that are not close to a multiple are handled by the rejection of a part of the generated random numbers.

Let's say that you have a uniform random number generator, which generates with equal probability anyone of N numbers. Then you want to choose with equal probability one of M choices.

If M divides N, then you can choose 1 of M by either multiplication with taking the integer part, or by division with taking the remainder.

When M does not divide N, for unbiased choices you must reject a part of the generated numbers, either rejecting them before the arithmetic operation (equivalent to diminishing N to a multiple of M), or rejecting them after the arithmetic operation (diminishing the maximum value of the integer part of product or of the division remainder, to match M).

This is enough for handling the case when M < N.

When M is greater than N, you can use a power of N that is greater than M (i.e. you use a tuple of numbers for making the choice), and you do the same as before.

However in this case you must trust your RNG that its output sequence is not auto-correlated.

If possible, using from the start a bigger N is preferable, but even when that is impossible, in most cases the unreachable parts of the space of random number tuples will not make any statistical difference.

To be more certain of this, you may want to repeat the experiment with several of the generators with the largest N available, taking care that they really have different structures, so that it can be expected that whichever is the inaccessible tuple space, it is not the same.

This is correct, but for the author's example of randomizing turkeys I wouldn't bother. A modern CSPRNG is fast enough that it's usually easier just to generate lots of excess randomness (so that the remainder is nonzero but tiny compared to the quotient and thus negligible) than to reject for exactly zero remainder.

For example, the turkeys could be randomized by generating 256 bits of randomness per turkey, then sorting by that and taking the first half of the list. By a counting argument this must be biased (since the number of assignments isn't usually a power of two), but the bias is negligible.

The rejection methods may be faster, and thus beneficial in something like a Monte Carlo simulation that executes many times. Rejection methods are also often the simplest way to get distributions other than uniform. The additional complexity doesn't seem worthwhile to me otherwise though, more effort and risk of a coding mistake for no meaningful gain.