重复有偏差的随机洗牌会减少偏差吗?
我想以最小的偏差重复产生快速随机洗牌。
众所周知,Fisher-Yates shuffle 是无偏的,只要底层随机数生成器 (RNG) 是无偏的。
To shuffle an array a of n elements:
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
但如果 RNG 有偏差(但速度很快)怎么办?
假设我想要生成 25 个元素的数组的许多随机排列。如果我将 Fisher-Yates 算法与有偏差的 RNG 一起使用,那么我的排列将会有偏差,但我相信这假设 25 元素数组在每次应用洗牌算法之前从相同的状态开始。例如,一个问题是,如果 RNG 仅具有 2^32 ~ 10^9 的周期,我们无法生成 25 个元素的所有可能排列,因为这是 25! ~ 10^25 排列。
我的一般问题是,如果我在开始 Fisher-Yates 洗牌的每个新应用之前将洗牌后的元素进行洗牌,这是否会减少偏差和/或允许算法生成每个排列?
我的猜测是,它通常会产生更好的结果,但似乎如果重复洗牌的数组有许多与底层 RNG 相关的元素,那么排列实际上可能会比预期重复得更频繁。
有谁知道有任何研究可以解决这个问题吗?
作为一个子问题,如果我只想重复排列数组中 25 个元素中的 5 个,因此我使用 Fisher-Yates 算法选择 5 个元素并在进行完全洗牌之前停止,该怎么办? (我使用交换的数组末尾的 5 个元素。)然后我重新使用之前部分打乱的 25 个元素数组来选择另一个 5 的排列。同样,这似乎比从如果底层 RNG 有偏差,则为原始 25 元素数组。对此有什么想法吗?
我认为测试部分洗牌情况会更容易,因为 25 个元素中的 5 个元素只有 6,375,600 种可能的排列,那么是否有任何简单的测试可用于检查偏差?
I'd like to produce fast random shuffles repeatedly with minimal bias.
It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.
To shuffle an array a of n elements:
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
But what if the RNG is biased (but fast)?
Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.
My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?
My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.
Does anyone know of any research that addresses this?
As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?
I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
只有当种子决定每个连续的选择时,这才是正确的。只要您的 RNG 可以在为每个下一个选择指定的范围内提供精确均匀的分布,那么它就可以生成每个排列。如果你的 RNG 无法做到这一点,那么拥有更大的种子基地也无济于事。
至于你的附带问题,你不妨每次抽签都重新播种。然而,只有当重新播种生成器包含足够的熵时,重新播种生成器才有用。时间戳不包含太多熵,算法计算也不包含太多熵。
我不确定这个解决方案是什么,因为您还没有列出它,但是如果您尝试使用随机输入从更大的域中计算某些内容,可能有更好的方法。
This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.
As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.
I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.
有几点:
1)任何使用 Fisher Yates shuffle 的人都应该阅读 这一点并双重确保他们的实施是正确的。
2)重复洗牌是否会破坏使用更快的随机数生成器的目的?当然,如果您必须重复每次洗牌 5 次才能获得所需的熵,那么您最好使用低偏差生成器。
3)你有可以测试这个的设置吗?如果是这样,请开始尝试 - Jeffs 图表清楚地表明,您可以通过使用小牌组并直观地描绘结果来轻松检测到相当多的错误。
A couple of points:
1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.
我的感觉是,在有偏见的 RNG 的情况下,重复运行 Knuth 洗牌会产生所有排列,但我无法证明这一点(这取决于 RNG 的周期和它有多少偏见)。
因此,让我们反转这个问题:给定一个需要随机输入和有偏差的 RNG 的算法,去偏算法的输出更容易,还是去偏 RNG 的输出更容易?
毫不奇怪,后者更容易做到(并且具有更广泛的兴趣):有几种标准技术可以做到这一点。冯·诺依曼提出的一种简单技术是:给定来自有偏差的 RNG 的比特流,成对获取比特,丢弃每个 (0,0) 和 (1,1) 对,为每个 (1,0) 返回 1对,每对 (0,1) 对应一个 0。该技术假设这些位来自一个流,其中每个位与流中的任何其他位具有相同的 0 或 1 概率,并且这些位不相关。 Elias 将冯·诺依曼的技术推广为一种更有效的方案(丢弃更少比特的方案)。
但即使是强烈偏差或相关的位,也可能包含有用的随机性,例如使用一种技术基于快速傅立叶变换。
另一种选择是将有偏差的 RNG 输出提供给加密功能强大的函数(例如消息摘要算法),并使用其输出。
有关如何消除随机数生成器偏差的更多参考,我建议您阅读随机性建议安全 RFC。
我的观点是,如果基于随机的算法的输出的质量受到 RNG 提供的熵的上限:如果它有极大的偏差,那么无论你做什么,输出都会有极大的偏差。该算法无法压缩比偏置随机比特流中包含的熵更多的熵。更糟糕的是:它可能会丢失一些随机位。即使假设该算法适用于有偏差的 RNG,为了获得良好的结果,您也必须投入至少与消除 RNG 所需的计算量一样大的计算量(但可能需要更多的努力,因为你必须同时运行算法并“击败”偏差)。
如果您的问题只是理论上的,那么请忽略此答案。如果可行,请认真考虑消除 RNG 的偏差,而不是对算法的输出做出假设。
My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).
So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?
Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).
But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.
Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.
For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.
My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).
If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.
我无法完全回答你的问题,但这个观察似乎太长了,无法发表评论。
如果您确保每次 Fisher-Yates 迭代从 RNG 中提取的随机数数量与 RNG 周期具有较高的最小公倍数,会发生什么情况?这可能意味着您在算法结束时“浪费”了一个随机整数。当打乱 25 个元素时,需要 24 个随机数。如果您在最后再抽取 1 个随机数,形成 25 个随机数,则不能保证您的重复时间比 RNG 周期长得多。当然,现在,在到达句点之前,您可以随机地连续出现相同的 25 个数字。但是,由于 25 除了 1 和 2^32 之外没有其他公因数,因此直到 25*(2^32) 才能保证重复。现在,这并不是一个巨大的进步,但你说这个 RNG 很快。如果“废物”值更大怎么办?获得每个排列可能仍然不切实际,但您至少可以增加可以达到的数量。
I can't completely answer your question, but this observation seemed too long for a comment.
What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.
这完全取决于偏见。一般来说,我会说“不要指望它”。
收敛到无偏差的有偏差算法:
一半时间什么都不做,另一半时间进行正确的洗牌。以指数方式向无偏收敛。经过 n 次洗牌后,有 1-1/2^n 的机会洗牌是无偏的,并且有 1/2^n 的机会选择输入序列。
保持偏见的偏见算法:
随机排列除最后一个元素之外的所有元素。永远倾向于不移动最后一个元素。
更一般的示例:
将洗牌算法视为排列的加权有向图,其中节点的权重对应于洗牌时从一种排列转换到另一种排列的概率。有偏差的洗牌算法将具有不均匀的权重。
现在假设您将该图中的一个节点注满了水,并且水根据权重从一个节点流到下一个节点。如果无论起始节点如何,水的分布都收敛到均匀,则算法将收敛到无偏。
那么什么情况下水会分布不均匀呢?好吧,如果你有一个体重高于平均水平的循环,则循环中的节点将倾向于相互供给并保持高于平均水量。他们不会把所有的水都拿走,因为随着他们获得更多的水,流入的水量会减少,流出的水量会增加,但会高于平均水平。
It depends entirely on the bias. In general I would say "don't count on it".
Biased algorithm that converges to non-biased:
Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.
Biased algorithm that stays biased:
Shuffle all elements except the last one. Permanently biased towards not moving the last element.
More General Example:
Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.
Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.
So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.