水库取样问题
这篇 MSDN 文章证明水库采样算法的正确性如下:
- < p>基本情况很简单。对于第 k+1 个 情况下,给定的概率 位置 <= k 的元素 i 在 R 中 是 s/k。
i被替换的概率是第k+1个元素被选择的概率乘以i被选择被替换的概率,即:s/(k+1) * 1/s = 1/(k+1),并且i 未被替换的概率为 k/k+1。
因此,任何给定元素在 k+1 轮后持续存在的概率为:(在 k 步中选择,并且在 k 步中未删除)= s/k * k/(k+1),即 s/(k+ 1).
因此,当 k+1 = n 时,任何元素都以概率 s/n 出现。
关于第3步:
提到的
k+1轮
是什么?什么是
在k步中选择的,并且在k步中没有删除的
?为什么我们只计算前
s
步骤后已经存在于R
中的元素的概率?
This MSDN article proves the correctness of Reservoir Sampling algorithm as follows:
Base case is trivial. For the k+1st
case, the probability a given
element i with position <= k is in R
is s/k.The probability i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.
So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps) = s/k * k/(k+1), which is s/(k+1).
So, when k+1 = n, any element is present with probability s/n.
about step 3:
What are the
k+1 rounds
mentioned?What is
chosen in k steps, and not removed in k steps
?Why are we only calculating this probability for elements that were already in
R
after the firsts
steps?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您熟悉归纳证明吗?
k
只是算法的中间步骤,证明不变量自始至终都是正确的,在这种情况下,第 k 个项目将具有选择的概率s/k
代表所有k
。Are you familiar with proof by induction?
k
is just the intermediate step of the algorithm, proving that the invariant is true all throughout, in this case, that the probability thek-th
item will have the probability chosens/k
for allk
.我们从 k 个项目的流中采样 s 个项目(其中 k 非常大,因此我们逐项处理流)。
处理流中的每个项目称为“回合”。
在一轮中,我们可能会用新项目替换已经存在的元素之一。
“在 k 步骤中选择”意味着在该项目出现在流中的那一轮中,我们选择用它替换一些其他项目(但我们没有忽略它)。
“在 k 步骤中未删除”意味着从那一刻起,我们没有选择用流中的某些新项目替换该项目。
We're sampling s items from a stream of k items (where k is very large, so we process the stream item by item).
Processing each item from the stream is called a 'round'.
During a round, we perhaps replace one of the elements already present with the new item.
'Chosen in k steps' means that during the round where the item appeared in the stream, we chose to replace some other item with it (t.i. we didn't ignore it).
'Not removed in k steps' means that since that moment, we've not chosen to replace this item with some new item from the stream.