水库取样问题

发布于 2024-08-28 09:48:12 字数 818 浏览 6 评论 0原文

这篇 MSDN 文章证明水库采样算法的正确性如下:


  1. < p>基本情况很简单。对于第 k+1 个 情况下,给定的概率 位置 <= k 的元素 i 在 R 中 是 s/k。

  2. i被替换的概率是第k+1个元素被选择的概率乘以i被选择被替换的概率,即:s/(k+1) * 1/s = 1/(k+1),并且i 未被替换的概率为 k/k+1。

  3. 因此,任何给定元素在 k+1 轮后持续存在的概率为:(在 k 步中选择,并且在 k 步中未删除)= s/k * k/(k+1),即 s/(k+ 1).

  4. 因此,当 k+1 = n 时,任何元素都以概率 s/n 出现。


关于第3步:

  • 提到的k+1轮是什么?

  • 什么是在k步中选择的,并且在k步中没有删除的

  • 为什么我们只计算前 s 步骤后已经存在于 R 中的元素的概率?

This MSDN article proves the correctness of Reservoir Sampling algorithm as follows:


  1. Base case is trivial. For the k+1st
    case, the probability a given
    element i with position <= k is in R
    is s/k.

  2. The probability i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

  3. So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps) = s/k * k/(k+1), which is s/(k+1).

  4. So, when k+1 = n, any element is present with probability s/n.


about step 3:

  • What are the k+1 rounds mentioned?

  • What is chosen in k steps, and not removed in k steps?

  • Why are we only calculating this probability for elements that were already in R after the first s steps?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

守望孤独 2024-09-04 09:48:12

您熟悉归纳证明吗? k 只是算法的中间步骤,证明不变量自始至终都是正确的,在这种情况下,第 k 个项目将具有选择的概率s/k 代表所有 k

Are you familiar with proof by induction? k is just the intermediate step of the algorithm, proving that the invariant is true all throughout, in this case, that the probability the k-th item will have the probability chosen s/k for all k.

季末如歌 2024-09-04 09:48:12

我们从 k 个项目的流中采样 s 个项目(其中 k 非常大,因此我们逐项处理流)。

处理流中的每个项目称为“回合”。

在一轮中,我们可能会用新项目替换已经存在的元素之一。

“在 k 步骤中选择”意味着在该项目出现在流中的那一轮中,我们选择用它替换一些其他项目(但我们没有忽略它)。
“在 k 步骤中未删除”意味着从那一刻起,我们没有选择用流中的某些新项目替换该项目。

We're sampling s items from a stream of k items (where k is very large, so we process the stream item by item).

Processing each item from the stream is called a 'round'.

During a round, we perhaps replace one of the elements already present with the new item.

'Chosen in k steps' means that during the round where the item appeared in the stream, we chose to replace some other item with it (t.i. we didn't ignore it).
'Not removed in k steps' means that since that moment, we've not chosen to replace this item with some new item from the stream.

挥剑断情 2024-09-04 09:48:12
  • “k+1 轮”表示“在考虑输入序列中的第 (k+1) 个项目之后”
  • s/k 是给定项目在 k 个步骤后位于存储库中的概率(通过归纳法),k/ (k+1) 是在第 (k+1) 步中该项目不会被替换的概率,
  • 我们希望确保每个输入项目以相同的概率保留在存储库中。因此,在我们的计算中,我们只对每个步骤中保留在水库中的物品感兴趣。
  • "k+1 rounds" means "after (k+1)-th item from the input sequence has been considered"
  • s/k is the probability that the given item will be in the reservoir after k steps (by induction), k/(k+1) is the probability that the item won't be replaced on the (k+1)-th step
  • we want to ensure that every input item remains in the reservoir with the same probability. so, in our calculations we are interested only in items that remain in the reservoir on each step.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文