C++ 的随机唯一子集的最快方法tr1 无序集

发布于 2024-10-08 04:04:51 字数 1592 浏览 10 评论 0原文

这个问题与这个，更准确地说是这个答案。

这里是：我有一个无符号整数的 C++/TR1 unordered_set U （粗略基数 100-50000，粗略值范围 0 到 10^6）。给定基数 N，我想尽快迭代 N 随机，但是 U 的唯一成员。 N 没有典型值，但应该有小 N 工作速度快。

更详细地说，这里的“随机性”概念是两个调用应该产生一些不同的子集——不同的越多，越好，但这并不是太重要。例如，我会很高兴连续（或环绕连续） U 的 N 个成员组成的块，只要块的起始索引是随机的。相同成本下非连续的更好，但主要关心的是速度。 U 变化温和地，但在调用之间不断（在调用之间插入/删除约 0-10 个元素）。

我已经走了多远：

简单的方法：
选择随机索引i，使得(i+N-1) |U|。获取一个指向 U.begin() 的迭代器 it，使用 it++ 将其推进 i 次，然后开始子集上的实际循环。优点：简单。缺点：浪费++'es。
桶方法（这是我从上面的链接“新”导出的）：
如上选择i，找到第i元素所在的桶b，得到一个local_iterator lit 到U.begin(b)，通过lit++前进lit，直到我们到达<的第i个元素code>U，从那时起，继续递增 lit 次 N 次。如果我们到达了桶底我们从下一个桶的开头继续lit。如果我想成功更随机，我可以完全随机地选择i并环绕桶。

我的开放性问题：

对于上面的第 2 点，我是否真的无法以某种方式获得一旦我找到第 i 个元素，迭代器就会进入 U 中？这会饶了我桶边界控制等。对我来说初学者，标准前向迭代器应该知道如何当到达第 i 项时继续遍历 U，但是当我自己找到第 i 项时，除了上面的第 2 点之外，不可能遍历 U。
我还能做什么？你知道有什么更聪明/更随机的事情吗？如果可以的话我不想参与手动控制存储桶大小、哈希函数等，因为这有点超出我的能力范围。

原文

This question is related to
this one, and more precisely to this answer to it.

Here goes: I have a C++/TR1 unordered_set U of unsigned ints (rough cardinality 100-50000, rough value range 0 to 10^6).
Given a cardinality N, I want to as quickly as possible iterate over N random but
unique members of U. There is no typical value for N, but it should
work fast for small N.

In more detail, the notion of "randomness" here is
that two calls should produce somewhat different subsets -- the more different,
the better, but this is not too crucial. I would e.g. be happy with a continuous
(or wrapped-around continuous)
block of N members of U, as long as the start index of the block is random.
Non-continuous at the same cost is better, but the main concern is speed. U changes
mildly, but constantly between calls (ca. 0-10 elements inserted/erased between calls).

How far I've come:

Trivial approach:
Pick random index i such that (i+N-1) < |U|.
Get an iterator it to U.begin(), advance it i times using it++, and then start
the actual loop over the subset. Advantage: easy. Disadvantage: waste of ++'es.
The bucket approach (and this I've "newly" derived from above link):
Pick i as above, find the bucket b in which the i-th element is in, get a local_iterator lit
to U.begin(b), advance lit via lit++ until we hit the i-th element of U, and from then on keep incrementing lit for N times. If we hit the end of the bucket,
we continue with lit from the beginning of the next bucket. If I want to make it
more random I can pick i completely random and wrap around the buckets.

My open questions:

For point 2 above, is it really the case that I cannot somehow get an
iterator into U once I've found the i-th element? This would spare me
the bucket boundary control, etc. For me as quite a
beginner, it seems unperceivable that the standard forward iterator should know how to
continue traversing U when at the i-th item, but when I found the i-th item myself,
it should not be possible to traverse U other than through point 2 above.
What else can I do? Do you know anything even much smarter/more random? If possible, I don't want to get involved in manual
control of bucket sizes, hash functions, and the like, as this is a bit over my head.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉生梦死 2024-10-15 04:04:51

根据您想要的运行时保证，有一种著名的 O(n) 算法，用于一次性从数字流中挑选 k 个随机元素。为了理解该算法，让我们首先看看我们只想从集合中选取一个元素的情况，然后我们将其推广到选取 k 个元素。这种方法的优点是，它不需要预先了解输入集的大小，并保证对元素进行可证明的均匀采样，这总是非常好的。

假设我们想从集合中挑选一个元素。为此，我们将遍历集合中的所有元素，并在每个点维护一个我们计划返回的候选元素。当我们迭代元素列表时，我们将以一定的概率更新我们的猜测，直到最后我们选择具有统一概率的单个元素。在每一点，我们将保持以下不变量：

看到 k 个元素后，当前选择前 k 个元素中的任何一个作为候选元素的概率为 1 / k。

如果我们在整个数组中保持这个不变式，那么在看到所有 n 个元素后，它们每个都有 1 / n 的机会成为候选元素。因此，候选元素已经以均匀随机概率被采样。

为了了解该算法是如何工作的，让我们考虑一下它必须做什么来维持不变量。假设我们刚刚看到了第一个元素。为了保持上述不变量，我们必须以概率 1 选择它，因此我们将对候选元素的初始猜测设置为第一个元素。

现在，当我们看到第二个元素时，我们需要保持每个元素以 1/2 的概率被选择的不变量，因为我们已经看到了两个元素。因此，假设我们以 1/2 的概率选择第二个元素。然后我们知道以下内容：

我们选择第二个元素的概率是 1/2。
我们选择第一个元素的概率是我们第一次选择它的概率 (1) 乘以我们没有选择第二个元素的概率 (1/2)。这也为 1/2。

所以此时仍然保持不变量！让我们看看当我们来到第三个元素时会发生什么。此时，我们需要确保每个元素以 1/3 的概率被选中。好吧，假设我们以 1/3 的概率选择最后一个元素。那么我们知道

我们选择第三个元素的概率是 1/3。
我们选择前两个元素中的任何一个的概率是在前两个步骤之后选择它的概率 (1/2) 乘以我们没有选择第三个元素的概率 (2/3)。计算结果为 1/3。

再次强调，不变量成立！

这里的一般模式如下所示：在我们看到 k 个元素之后，每个元素都有 1/k 的机会被选取。当我们看到第 (k + 1) 个元素时，我们以 1 / (k + 1) 的概率选择它。这意味着它被选择的概率为 1 / (k + 1)，并且它之前的所有元素被选择的概率等于我们之前选择它的概率 (1 / k) 并且没有选择 (k + 1)这次第 ) 个元素 (k / (k + 1))，这使得这些元素被选择的概率为 1 / (k + 1)。由于这在每一步都保持了不变性，因此我们得到了一个很棒的算法：

当您看到第一个元素时，选择它作为候选元素。
对于每个连续元素，以概率 1 / k 替换候选元素，其中 k 是迄今为止看到的元素数量。

其运行时间为 O(n)，需要 O(1) 空间，并从数据流中返回一个均匀随机的元素。

现在，如果我们想从集合中挑选 k 个元素，而不仅仅是一个，那么让我们看看如何扩展它以使其工作。这个想法与之前的算法非常相似（实际上最终是更通用的算法的一个特例）。我们不是维护一个候选者，而是维护 k 个不同的候选者，这些候选者存储在一个数组中，编号为 1, 2, ..., k。在每一点，我们都保持这个不变量：

看到m后> k 个元素，前 m 个元素中的任何一个被选择的概率为
千/米。

如果我们扫描整个数组，这意味着当我们完成后，每个元素都有被选择的概率 k / n。由于我们选择了 k 个不同的元素，这意味着我们从数组中均匀随机地采样元素。

该算法与之前类似。首先，以概率 1 从集合中选择前 k 个元素。这意味着当我们看到 k 个元素时，其中任何一个被选择的概率为 1 = k / k 并且不变量成立。现在，归纳假设在 m 次迭代后不变式成立并考虑第 (m + 1) 次迭代。选择 1 到 (m + 1)（含）之间的随机数。如果我们选择 1 到 k（含）之间的数字，则用下一个元素替换该候选元素。否则，不选择下一个元素。这意味着我们根据需要以 k / (m + 1) 的概率选择下一个元素。前 m 个元素中的任何一个被选择的概率就是它们之前被选择的概率 (k / m) 乘以我们没有选择包含该元素的槽的概率 (m / (m + 1))，其中给出按要求被选择的总概率 k / (m + 1)。通过归纳，这证明了该算法完全均匀地、随机地从集合中采样了 k 个元素！

而且，运行时间是 O(n)，它与集合的大小成正比，完全与你要选择的元素数量无关。它还仅使用 O(k) 内存，并且不对所存储元素的类型做出任何假设。

既然你想为C++做这件事，作为一种无耻的自我推销，我有一个这个算法的实现（写成STL算法）可在我的个人网站上找到。放心使用吧！

希望这有帮助！

Depending on what runtime guarantees you want, there's a famous O(n) algorithm for picking k random elements out of a stream of numbers in one pass. To understand the algorithm, let's see it first for the case where we want to pick just one element out of the set, then we'll generalize it to work for picking k elements. The advantage of this approach is that it doesn't require any advance knowledge of the size of the input set and guarantees provably uniform sampling of elements, which is always pretty nice.

Suppose that we want to pick one element out of the set. To do this, we'll make a pass over all of the elements in the set and at each point will maintain a candidate element that we're planning on returning. As we iterate across the list of elements, we'll update our guess with some probability until at the very end we've chosen a single element with uniform probability. At each point, we will maintain the following invariant:

After seeing k elements, the probability that any of the first k elements is currently chosen as the candidate element is 1 / k.

If we maintain this invariant across the entire array, then after seeing all n elements, each of them has a 1 / n chance of being the candidate element. Thus the candidate element has been sampled with uniformly random probability.

To see how the algorithm works, let's think about what it has to do to maintain the invariant. Suppose that we've just seen the very first element. To maintain the above invariant, we have to choose it with probability 1, so we'll set our initial guess of the candidate element to be the first element.

Now, when we come to the second element, we need to hold the invariant that each element is chosen with probability 1/2, since we've seen two elements. So let's suppose that with probability 1/2 we choose the second element. Then we know the following:

The probability that we've picked the second element is 1/2.
The probability that we've picked the first element is the probability that we chose it the first time around (1) times the probability that we didn't just pick the second element (1/2). This comes out to 1/2 as well.

So at this point the invariant is still maintained! Let's see what happens when we come to the third element. At this point, we need to ensure that each element is picked with probability 1/3. Well, suppose that with probability 1/3 we choose the last element. Then we know that

The probability that we've picked the third element is 1/3.
The probability that we've picked either of the first two elements is the probability that it was chosen after the first two steps (1/2) times the probability that we didn't choose the third element (2/3). This works out to 1/3.

So again, the invariant holds!

The general pattern here looks like this: After we've seen k elements, each of the elements has a 1/k chance of being picked. When we see the (k + 1)st element, we choose it with probability 1 / (k + 1). This means that it's chosen with probability 1 / (k + 1), and all of the elements before it are chosen with probability equal to the odds that we picked it before (1 / k) and didn't pick the (k + 1)st element this time (k / (k + 1)), which gives those elements each a probability of 1 / (k + 1) of being chosen. Since this maintains the invariant at each step, we've got ourselves a great algorithm:

Choose the first element as the candidate when you see it.
For each successive element, replace the candidate element with that element with probability 1 / k, where k is the number of elements seen so far.

This runs in O(n) time, requires O(1) space, and gives back a uniformly-random element out of the data stream.

Now, let's see how to scale this up to work if we want to pick k elements out of the set, not just one. The idea is extremely similar to the previous algorithm (which actually ends up being a special case of the more general one). Instead of maintaining one candidate, we maintain k different candidates, stored in an array that we number 1, 2, ..., k. At each point, we maintain this invariant:

After seeing m > k elements, the probability that any of the first m elements is chosen is
k / m.

If we scan across the entire array, this means that when we're done, each element has probability k / n of being chosen. Since we're picking k different elements, this means that we sample the elements out of the array uniformly at random.

The algorithm is similar to before. First, choose the first k elements out of the set with probability 1. This means that when we've seen k elements, the probability that any of them have been picked is 1 = k / k and the invariant holds. Now, assume inductively that the invariant holds after m iterations and consider the (m + 1)st iteration. Choose a random number between 1 and (m + 1), inclusive. If we choose a number between 1 and k (inclusive), replace that candidate element with the next element. Otherwise, do not choose the next element. This means that we pick the next element with probability k / (m + 1) as required. The probability that any of the first m elements are chosen is then the probability that they were chosen before (k / m) times the probability that we didn't choose the slot containing that element (m / (m + 1)), which gives a total probability of being chosen of k / (m + 1) as required. By induction, this proves that the algorithm perfectly uniformly and randomly samples k elements out of the set!

Moreover, the runtime is O(n), which is proportional to the size of the set, which is completely independent of the number of elements you want to choose. It also uses only O(k) memory and makes no assumptions whatsoever about the type of the elements being stored.

Since you're trying to do this for C++, as a shameless self-promotion, I have an implementation of this algorithm (written as an STL algorithm) available here on my personal website. Feel free to use it!

Hope this helps!

回复收藏 0 原文

~没有更多了~