如何在Python中对类别进行加权随机抽样
给定一个元组列表,其中每个元组都包含一个概率和一个项目,我想根据其概率对项目进行采样。例如,给出列表 [ (.3, 'a'), (.4, 'b'), (.3, 'c')] 我想在 40% 的时间内对 'b' 进行采样。
在 python 中执行此操作的规范方法是什么?
我查看了 random 模块,它似乎没有适当的函数,并且在 numpy.random 中,尽管它有一个多项式函数,但似乎没有以良好的形式返回此问题的结果。我基本上是在 matlab 中寻找类似 mnrnd 的东西。
非常感谢。
感谢您这么快的答复。为了澄清,我并不是在寻找如何编写采样方案的解释,而是要指出一种从给定一组对象和权重的多项分布中采样的简单方法,或者被告知不存在这样的函数在标准库中,因此应该编写自己的库。
Given a list of tuples where each tuple consists of a probability and an item I'd like to sample an item according to its probability. For example, give the list [ (.3, 'a'), (.4, 'b'), (.3, 'c')] I'd like to sample 'b' 40% of the time.
What's the canonical way of doing this in python?
I've looked at the random module which doesn't seem to have an appropriate function and at numpy.random which although it has a multinomial function doesn't seem to return the results in a nice form for this problem. I'm basically looking for something like mnrnd in matlab.
Many thanks.
Thanks for all the answers so quickly. To clarify, I'm not looking for explanations of how to write a sampling scheme, but rather to be pointed to an easy way to sample from a multinomial distribution given a set of objects and weights, or to be told that no such function exists in a standard library and so one should write one's own.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这可能会做你想要的:
This might do what you want:
由于没有人使用 numpy.random.choice函数,这里将在一个紧凑的行中生成您需要的内容:
Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:
您究竟希望如何收到结果?
How exactly would you like to receive the results?
例如,如果您的概率非常适合百分比等,那么您可以采取一些技巧。
例如,如果您对百分比很满意,则以下内容将起作用(以高内存开销为代价):
但是“真实的” “使用任意浮点概率进行此操作的方法是在构建累积分布后从累积分布中进行采样。这相当于将单位区间[0,1]细分为3条线段,分别标记为‘a’、‘b’、‘c’;然后在单位间隔上随机选取一个点,看看它是哪条线段。
我们必须小心那些即使概率为 0 也会返回值的方法。幸运的是,该方法不会返回值,但为了以防万一,可以插入
if prob==0: continue
。根据记录,这是一种黑客方法:
但是,如果您没有解决问题......这实际上可能是最快的方法。 =)
There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.
For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):
But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.
One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert
if prob==0: continue
.For the record, here's the hackish way to do it:
However if you don't have resolution issues... this is actually probably the fastest way possible. =)
如何在列表中创建 3 个“a”、4 个“b”和 3 个“c”,然后随机选择一个。通过足够的迭代,您将获得所需的概率。
Howabout creating 3 "a", 4 "b" and 3 "c" in a list an then just randomly select one. With enough iterations you will get the desired probability.
我认为多项式函数仍然是一种以随机顺序获取分布样本的相当简单的方法。这只是一种方式
,其中输入按照指定
[(.2, 'a'), (.4, 'b'), (.3, 'c')]
大小是数字您需要的样品。I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way
Where inputs is as specified
[(.2, 'a'), (.4, 'b'), (.3, 'c')]
and size is the number of samples you need.我不确定这是否是执行您要求的操作的 pythonic 方式,但您可以使用
random.sample(['a','a','a','b','b','b','b','c','c','c'],k )
其中 k 是您想要的样本数。
对于更稳健的方法,根据累积概率将单位间隔分成几部分,并使用 random.random() 从均匀分布 (0,1) 中绘制。在这种情况下,子区间将为 (0,.3)(.3,.7)(.7,1)。您可以根据元素所属的子区间来选择元素。
I'm not sure if this is the pythonic way of doing what you ask, but you could use
random.sample(['a','a','a','b','b','b','b','c','c','c'],k)
where k is the number of samples you want.
For a more robust method, bisect the unit interval into sections based on the cumulative probability and draw from the uniform distribution (0,1) using random.random(). In this case the subintervals would be (0,.3)(.3,.7)(.7,1). You choose the element based on which subinterval it falls into.
只是受到
sholte
非常简单(且正确)的答案的启发:我将演示扩展它来处理任意项目是多么容易,例如:更新:
根据
phant0m
的反馈,事实证明,可以基于多项式
实现更简单的解决方案,例如:恕我直言,我们对
有一个很好的总结基于经验的 cdf
和基于多项式
的采样产生相似的结果。因此,总而言之,选择最适合您目的的一个。Just inspired of
sholte
's very straightforward (and correct) answer: I'll just demonstrate how easy it will be to extend it to handle arbitrary items, like:Update:
Based on the feedback of
phant0m
, it turns out that an even more straightforward solution can be implemented based onmultinomial
, like:IMHO here we have a nice summary of
empirical cdf
andmultinomial
based sampling yielding similar results. So, in a summary, pick it up one which suits best for your purposes.这可能有边际效益,但我是这样做的:
这与@eat的答案类似。
This may be of marginal benefit but I did it this way:
This is similar to @eat's answer.