使用最少的存储空间对大量数字进行洗牌

发布于 2024-11-09 19:49:41 字数 802 浏览 8 评论 0原文

我有一个非常大的数字范围/集合,(1..1236401668096),我基本上想“洗牌”,即随机遍历而不重新访问相同的号码。我将运行一个 Web 服务,每次收到请求时,它都会增加一个计数器,并从范围中提取下一个“随机”数字。该算法必须适应服务器离线,能够使用计数器的持久值重新启动遍历(类似于如何为伪随机数生成器播种,并在给定种子和的情况下获得相同的伪随机数)您正在进行哪个迭代)。

我想知道这样的算法是否存在或可行。我见过 Fisher-Yates Shuffle,但第一步是到“写下从 1 到 N 的数字”,这将占用我的整个范围的 TB 存储空间。为每个请求生成伪随机数可能会工作一段时间,但随着数据库/树变满,冲突将变得更加常见,并且可能会降低性能(根据我的计算,在 10 亿次点击后,冲突的可能性已经是 0.08%)。对于我的场景是否有更理想的解决方案,或者这只是一个白日梦?

进行洗牌的原因是,能够正确猜测序列中的下一个数字可能会导致我的应用程序中出现一个较小的 DOS 漏洞,而且还因为数字分布更广泛时,表示层看起来会更好(我宁愿不这样做)详细了解应用程序的具体功能)。此时,我正在考虑仅使用 PRNG 并处理冲突或洗牌范围切片(从 (1..10000000).to_a.shuffle 开始,然后 (10000001, 20000000) .to_a.shuffle 等,因为每个范围的数字开始耗尽)。

那里有数学魔术师有更好的想法/建议吗?

I've got a very large range/set of numbers, (1..1236401668096), that I would basically like to 'shuffle', i.e. randomly traverse without revisiting the same number. I will be running a Web service, and each time a request comes in it will increment a counter and pull the next 'shuffled' number from the range. The algorithm will have to accommodate for the server going offline, being able to restart traversal using the persisted value of the counter (something like how you can seed a pseudo-random number generator, and get the same pseudo-random number given the seed and which iteration you are on).

I'm wondering if such an algorithm exists or is feasible. I've seen the Fisher-Yates Shuffle, but the 1st step is to "Write down the numbers from 1 to N", which would take terabytes of storage for my entire range. Generating a pseudo-random number for each request might work for awhile, but as the database/tree gets full, collisions will become more common and could degrade performance (already a 0.08% chance of collision after 1 billion hits according to my calculation). Is there a more ideal solution for my scenario, or is this just a pipe dream?

The reason for the shuffling is that being able to correctly guess the next number in the sequence could lead to a minor DOS vulnerability in my app, but also because the presentation layer will look much nicer with a wider number distribution (I'd rather not go into details about exactly what the app does). At this point I'm considering just using a PRNG and dealing with collisions or shuffling range slices (starting with (1..10000000).to_a.shuffle, then, (10000001, 20000000).to_a.shuffle, etc. as each range's numbers start to run out).

Any mathemagicians out there have any better ideas/suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

青衫负雪 2024-11-16 19:49:42

将 PRNG 或 LFSR 序列与 /dev/random 位连接

有多种算法可以生成具有任意大且已知周期的伪随机数。两个明显的候选算法是 LCPRNG (LCG) 和 LFSR,但还有更多算法,例如 Mersenne Twister。

这些发电机的周期可以很容易地构建以满足您的要求,这样您就不会发生碰撞。

您可以通过从 /dev/random 等接口添加 10、20 或 30 位加密散列熵来处理 PRNG 和 LFSR 的可预测行为。 因为数字的确定性部分已知是独一无二的,如果你重复它实际上随机的部分,那没有什么区别。

Concatenate a PRNG or LFSR sequence with /dev/random bits

There are several algorithms that can generate pseudo-random numbers with arbitrarily large and known periods. The two obvious candidates are the LCPRNG (LCG) and the LFSR, but there are more algorithms such as the Mersenne Twister.

The period of these generators can be easily constructed to fit your requirements and then you simply won't have collisions.

You could deal with the predictable behavior of PRNG's and LFSR's by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface like /dev/random. Because the deterministic part of your number is known to be unique it makes no difference if you ever repeat the actually random part of it.

伴我老 2024-11-16 19:49:42

分而治之?分解成可管理的块并对其进行打乱。您可以将数字范围除以它们的模 n 的值。该列表是建设性的并且相当小,具体取决于 n。当一组用完后,您可以使用下一组。

例如,如果您选择 n 为 1000,则会创建 1000 个不同的组。选择一个 1 到 1000 之间的随机数(我们称之为 x),然后对模 1000 等于 x 的数字进行洗牌。一旦你用尽了这个范围,你可以选择一个 1 到 1000 之间的新随机数(显然没有 x)来获得下一个要洗牌的子集。跟踪 1..1000 范围内的哪些数字已经被使用应该不是什么挑战,所以你只需要一个可重复的洗牌算法来处理子集中的数字(例如,Fisher-Yates 在其“索引”上) ”)。

Divide and conquer? Break down into manageable chunks and shuffle them. You could divide the number range e.g. by their value modulo n. The list is constructive and quite small depending on n. Once a group is exhausted, you can use the next one.

For example if you choose an n of 1000, you create 1000 different groups. Pick a random number between 1 and 1000 (let's call this x) and shuffle the numbers whose value modulo 1000 equals x. Once you have exhausted that range, you can choose a new random number between 1 and 1000 (without x obviously) to get the next subset to shuffle. It shouldn't exactly be challenging to keep track of which numbers of the 1..1000 range have already been used, so you'd just need a repeatable shuffle algorithm for the numbers in the subset (e.g. Fisher-Yates on their "indices").

灼痛 2024-11-16 19:49:42

我想最好的选择是使用 GUID/UUID。它们是为此类事情而设计的,找到满足您需求的现有实现应该不难。

虽然碰撞在理论上是可能的,但可能性极小。引用维基百科:

如果地球上每个人都拥有 6 亿个 UUID,出现一个重复的概率约为 50%

I guess the best option is to use a GUID/UUID. They are made for this type of thing, and it shouldn't be hard to find an existing implementation to suit your needs.

While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:

The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文