重复有偏差的随机洗牌会减少偏差吗？

发布于 2024-09-25 15:16:56 字数 1025 浏览 8 评论 0原文

我想以最小的偏差重复产生快速随机洗牌。

众所周知，Fisher-Yates shuffle 是无偏的，只要底层随机数生成器 (RNG) 是无偏的。

To shuffle an array a of n elements:
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

但如果 RNG 有偏差（但速度很快）怎么办？

假设我想要生成 25 个元素的数组的许多随机排列。如果我将 Fisher-Yates 算法与有偏差的 RNG 一起使用，那么我的排列将会有偏差，但我相信这假设 25 元素数组在每次应用洗牌算法之前从相同的状态开始。例如，一个问题是，如果 RNG 仅具有 2^32 ~ 10^9 的周期，我们无法生成 25 个元素的所有可能排列，因为这是 25！ ~ 10^25 排列。

我的一般问题是，如果我在开始 Fisher-Yates 洗牌的每个新应用之前将洗牌后的元素进行洗牌，这是否会减少偏差和/或允许算法生成每个排列？

我的猜测是，它通常会产生更好的结果，但似乎如果重复洗牌的数组有许多与底层 RNG 相关的元素，那么排列实际上可能会比预期重复得更频繁。

有谁知道有任何研究可以解决这个问题吗？

作为一个子问题，如果我只想重复排列数组中 25 个元素中的 5 个，因此我使用 Fisher-Yates 算法选择 5 个元素并在进行完全洗牌之前停止，该怎么办？（我使用交换的数组末尾的 5 个元素。）然后我重新使用之前部分打乱的 25 个元素数组来选择另一个 5 的排列。同样，这似乎比从如果底层 RNG 有偏差，则为原始 25 元素数组。对此有什么想法吗？

我认为测试部分洗牌情况会更容易，因为 25 个元素中的 5 个元素只有 6,375,600 种可能的排列，那么是否有任何简单的测试可用于检查偏差？

原文

I'd like to produce fast random shuffles repeatedly with minimal bias.

It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.

To shuffle an array a of n elements:
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

But what if the RNG is biased (but fast)?

Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.

My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?

My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.

Does anyone know of any research that addresses this?

As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?

I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴情换悲伤 2024-10-02 15:16:56

如果RNG的周期只有2^32~
10^9 我们无法产生所有可能的
25 个元素的排列，因为
这是25！ ~ 10^25 排列

只有当种子决定每个连续的选择时，这才是正确的。只要您的 RNG 可以在为每个下一个选择指定的范围内提供精确均匀的分布，那么它就可以生成每个排列。如果你的 RNG 无法做到这一点，那么拥有更大的种子基地也无济于事。

至于你的附带问题，你不妨每次抽签都重新播种。然而，只有当重新播种生成器包含足够的熵时，重新播种生成器才有用。时间戳不包含太多熵，算法计算也不包含太多熵。

我不确定这个解决方案是什么，因为您还没有列出它，但是如果您尝试使用随机输入从更大的域中计算某些内容，可能有更好的方法。

回复收藏 0 原文

┊风居住的梦幻卍 2024-10-02 15:16:56

有几点：

1）任何使用 Fisher Yates shuffle 的人都应该阅读这一点并双重确保他们的实施是正确的。
2）重复洗牌是否会破坏使用更快的随机数生成器的目的？当然，如果您必须重复每次洗牌 5 次才能获得所需的熵，那么您最好使用低偏差生成器。
3）你有可以测试这个的设置吗？如果是这样，请开始尝试 - Jeffs 图表清楚地表明，您可以通过使用小牌组并直观地描绘结果来轻松检测到相当多的错误。

回复收藏 0 原文

雨后彩虹 2024-10-02 15:16:56

我的感觉是，在有偏见的 RNG 的情况下，重复运行 Knuth 洗牌会产生所有排列，但我无法证明这一点（这取决于 RNG 的周期和它有多少偏见）。

因此，让我们反转这个问题：给定一个需要随机输入和有偏差的 RNG 的算法，去偏算法的输出更容易，还是去偏 RNG 的输出更容易？

毫不奇怪，后者更容易做到（并且具有更广泛的兴趣）：有几种标准技术可以做到这一点。冯·诺依曼提出的一种简单技术是：给定来自有偏差的 RNG 的比特流，成对获取比特，丢弃每个 (0,0) 和 (1,1) 对，为每个 (1,0) 返回 1对，每对 (0,1) 对应一个 0。该技术假设这些位来自一个流，其中每个位与流中的任何其他位具有相同的 0 或 1 概率，并且这些位不相关。 Elias 将冯·诺依曼的技术推广为一种更有效的方案（丢弃更少比特的方案）。

但即使是强烈偏差或相关的位，也可能包含有用的随机性，例如使用一种技术基于快速傅立叶变换。

另一种选择是将有偏差的 RNG 输出提供给加密功能强大的函数（例如消息摘要算法），并使用其输出。

有关如何消除随机数生成器偏差的更多参考，我建议您阅读随机性建议安全 RFC。

我的观点是，如果基于随机的算法的输出的质量受到 RNG 提供的熵的上限：如果它有极大的偏差，那么无论你做什么，输出都会有极大的偏差。该算法无法压缩比偏置随机比特流中包含的熵更多的熵。更糟糕的是：它可能会丢失一些随机位。即使假设该算法适用于有偏差的 RNG，为了获得良好的结果，您也必须投入至少与消除 RNG 所需的计算量一样大的计算量（但可能需要更多的努力，因为你必须同时运行算法并“击败”偏差）。

如果您的问题只是理论上的，那么请忽略此答案。如果可行，请认真考虑消除 RNG 的偏差，而不是对算法的输出做出假设。

My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).

So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?

Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).

But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.

Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.

For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.

My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).

If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.

回复收藏 0 原文

银河中√捞星星 2024-10-02 15:16:56

我无法完全回答你的问题，但这个观察似乎太长了，无法发表评论。

如果您确保每次 Fisher-Yates 迭代从 RNG 中提取的随机数数量与 RNG 周期具有较高的最小公倍数，会发生什么情况？这可能意味着您在算法结束时“浪费”了一个随机整数。当打乱 25 个元素时，需要 24 个随机数。如果您在最后再抽取 1 个随机数，形成 25 个随机数，则不能保证您的重复时间比 RNG 周期长得多。当然，现在，在到达句点之前，您可以随机地连续出现相同的 25 个数字。但是，由于 25 除了 1 和 2^32 之外没有其他公因数，因此直到 25*(2^32) 才能保证重复。现在，这并不是一个巨大的进步，但你说这个 RNG 很快。如果“废物”值更大怎么办？获得每个排列可能仍然不切实际，但您至少可以增加可以达到的数量。

回复收藏 0 原文