从带有样本的多项分布中抽取一个巨大的样本(1e09)
我想从多项分布中采样。我将通过使用样本并指定一些概率来做到这一点。 例如:我有 3 个类别,我想采样 10 次。
> my_prob = c(0.2, 0.3, 0.5)
> x = sample(c(0:2), 100, replace = T, prob = my_prob)
> head(x)
[1] 2 0 2 1 1 2
我的设置现在仅在以下方面有所不同:我想采样很多(例如 1e09)数字。实际上我只对每个类别的频率感兴趣。 因此,在上面提到的示例中,这意味着:
> table(x)
x
0 1 2
27 29 44
有人知道如何尽可能高效地计算它吗?
谢谢, 斯特菲
I would like to sample from a multinomial distribution. I would do this by using sample and specifying some probabilites.
E.g: I have 3 categories, and I want to sample 10 times.
> my_prob = c(0.2, 0.3, 0.5)
> x = sample(c(0:2), 100, replace = T, prob = my_prob)
> head(x)
[1] 2 0 2 1 1 2
My setting is now only different in the following aspect: I want to sample a lot (e.g. 1e09) numbers. And actually I am only interested in the frequency of each category.
So in the above mentioned example this would mean:
> table(x)
x
0 1 2
27 29 44
Does anybody have an idea how to compute this as efficient as possible?
thanks,
steffi
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要
rmultinom
。You need
rmultinom
.如果问题是您无法将长度为 1e9 的向量放入 RAM,那么您可以针对较少数量的样本重复计算该表并将总数相加。
就像Max说的,您可能更喜欢
rmultinom
而不是示例。获取其experiments
变量的rowSums
。If the problem is that you can't fit a vector of length 1e9 into RAM, then you can repeatedly calculate the table for a smaller number of samples and add up the totals.
Like Max said, you might prefer
rmultinom
over sample. Take therowSums
of hisexperiments
variable.