当前位置：文江博客话题详情

如何在Python中对类别进行加权随机抽样

发布于 2024-11-16 16:08:11 字数 383 浏览 2 评论 0原文

给定一个元组列表，其中每个元组都包含一个概率和一个项目，我想根据其概率对项目进行采样。例如，给出列表 [ (.3, 'a'), (.4, 'b'), (.3, 'c')] 我想在 40% 的时间内对 'b' 进行采样。

在 python 中执行此操作的规范方法是什么？

我查看了 random 模块，它似乎没有适当的函数，并且在 numpy.random 中，尽管它有一个多项式函数，但似乎没有以良好的形式返回此问题的结果。我基本上是在 matlab 中寻找类似 mnrnd 的东西。

非常感谢。

感谢您这么快的答复。为了澄清，我并不是在寻找如何编写采样方案的解释，而是要指出一种从给定一组对象和权重的多项分布中采样的简单方法，或者被告知不存在这样的函数在标准库中，因此应该编写自己的库。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亢潮 2024-11-23 16:08:11

这可能会做你想要的：

numpy.array([.3,.4,.3]).cumsum().searchsorted(numpy.random.sample(5))

This might do what you want:

numpy.array([.3,.4,.3]).cumsum().searchsorted(numpy.random.sample(5))

回复收藏 0 原文

难得心□动 2024-11-23 16:08:11

由于没有人使用 numpy.random.choice函数，这里将在一个紧凑的行中生成您需要的内容：

numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])

Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:

numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])

回复收藏 0 原文

半﹌身腐败 2024-11-23 16:08:11

import numpy

n = 1000
pairs = [(.3, 'a'), (.3, 'b'), (.4, 'c')]
probabilities = numpy.random.multinomial(n, zip(*pairs)[0])
result = zip(probabilities, zip(*pairs)[1])
# [(299, 'a'), (299, 'b'), (402, 'c')]
[x[0] * x[1] for x in result]
# ['aaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbb', 'cccccccccccccccccccc']

您究竟希望如何收到结果？

import numpy

n = 1000
pairs = [(.3, 'a'), (.3, 'b'), (.4, 'c')]
probabilities = numpy.random.multinomial(n, zip(*pairs)[0])
result = zip(probabilities, zip(*pairs)[1])
# [(299, 'a'), (299, 'b'), (402, 'c')]
[x[0] * x[1] for x in result]
# ['aaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbb', 'cccccccccccccccccccc']

How exactly would you like to receive the results?

回复收藏 0 原文

_失温 2024-11-23 16:08:11

例如，如果您的概率非常适合百分比等，那么您可以采取一些技巧。

例如，如果您对百分比很满意，则以下内容将起作用（以高内存开销为代价）：

但是“真实的” “使用任意浮点概率进行此操作的方法是在构建累积分布后从累积分布中进行采样。这相当于将单位区间[0,1]细分为3条线段，分别标记为‘a’、‘b’、‘c’；然后在单位间隔上随机选取一个点，看看它是哪条线段。

#!/usr/bin/python3
def randomCategory(probDict):
    """
        >>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}

        >>> [randomCategory(dist) for _ in range(5)]
        ['c', 'c', 'a', 'd', 'c']

        >>> Counter(randomCategory(dist) for _ in range(10**5))
        Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
    """
    r = random.random() # range: [0,1)
    total = 0           # range: [0,1]
    for value,prob in probDict.items():
        total += prob
        if total>r:
            return value
    raise Exception('distribution not normalized: {probs}'.format(probs=probDict))

我们必须小心那些即使概率为 0 也会返回值的方法。幸运的是，该方法不会返回值，但为了以防万一，可以插入 if prob==0: continue。

根据记录，这是一种黑客方法：

import random

def makeSampler(probDict):
    """
        >>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
        >>> sampler.sample()
        'a'
        >>> sampler.sample()
        'c'
    """
    oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
    def sampler():
        return random.choice(oneHundredElements)
    return sampler

但是，如果您没有解决问题......这实际上可能是最快的方法。 =)

There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.

For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):

But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.

#!/usr/bin/python3
def randomCategory(probDict):
    """
        >>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}

        >>> [randomCategory(dist) for _ in range(5)]
        ['c', 'c', 'a', 'd', 'c']

        >>> Counter(randomCategory(dist) for _ in range(10**5))
        Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
    """
    r = random.random() # range: [0,1)
    total = 0           # range: [0,1]
    for value,prob in probDict.items():
        total += prob
        if total>r:
            return value
    raise Exception('distribution not normalized: {probs}'.format(probs=probDict))

One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert if prob==0: continue.

For the record, here's the hackish way to do it:

import random

def makeSampler(probDict):
    """
        >>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
        >>> sampler.sample()
        'a'
        >>> sampler.sample()
        'c'
    """
    oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
    def sampler():
        return random.choice(oneHundredElements)
    return sampler

However if you don't have resolution issues... this is actually probably the fastest way possible. =)

回复收藏 0 原文

丑丑阿 2024-11-23 16:08:11

如何在列表中创建 3 个“a”、4 个“b”和 3 个“c”，然后随机选择一个。通过足够的迭代，您将获得所需的概率。

回复收藏 0 原文

国粹 2024-11-23 16:08:11

我认为多项式函数仍然是一种以随机顺序获取分布样本的相当简单的方法。这只是一种方式

import numpy
from itertools import izip

def getSamples(input, size):
    probabilities, items = zip(*input)
    sampleCounts = numpy.random.multinomial(size, probabilities)
    samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
    numpy.random.shuffle(samples)
    return samples

def countsToSamples(counts, items):
    for value, repeats in izip(items, counts):
        for _i in xrange(repeats):
            yield value

，其中输入按照指定[(.2, 'a'), (.4, 'b'), (.3, 'c')] 大小是数字您需要的样品。

I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way

import numpy
from itertools import izip

def getSamples(input, size):
    probabilities, items = zip(*input)
    sampleCounts = numpy.random.multinomial(size, probabilities)
    samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
    numpy.random.shuffle(samples)
    return samples

def countsToSamples(counts, items):
    for value, repeats in izip(items, counts):
        for _i in xrange(repeats):
            yield value

Where inputs is as specified [(.2, 'a'), (.4, 'b'), (.3, 'c')] and size is the number of samples you need.

回复收藏 0 原文

打小就很酷 2024-11-23 16:08:11

我不确定这是否是执行您要求的操作的 pythonic 方式，但您可以使用
random.sample(['a','a','a','b','b','b','b','c','c','c'],k ）
其中 k 是您想要的样本数。

对于更稳健的方法，根据累积概率将单位间隔分成几部分，并使用 random.random() 从均匀分布 (0,1) 中绘制。在这种情况下，子区间将为 (0,.3)(.3,.7)(.7,1)。您可以根据元素所属的子区间来选择元素。

回复收藏 0 原文

断念 2024-11-23 16:08:11

只是受到 sholte 非常简单（且正确）的答案的启发：我将演示扩展它来处理任意项目是多么容易，例如：

In []: s= array([.3, .4, .3]).cumsum().searchsorted(sample(54))
In []: c, _= histogram(s, bins= arange(4))
In []: [item* c[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccccccc']

更新：
根据 phant0m 的反馈，事实证明，可以基于 多项式 实现更简单的解决方案，例如：

In []: s= multinomial(54, [.3, .4, .3])
In []: [item* s[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccc']

恕我直言，我们对 有一个很好的总结基于经验的 cdf 和基于多项式的采样产生相似的结果。因此，总而言之，选择最适合您目的的一个。

Just inspired of sholte's very straightforward (and correct) answer: I'll just demonstrate how easy it will be to extend it to handle arbitrary items, like:

In []: s= array([.3, .4, .3]).cumsum().searchsorted(sample(54))
In []: c, _= histogram(s, bins= arange(4))
In []: [item* c[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccccccc']

Update:
Based on the feedback of phant0m, it turns out that an even more straightforward solution can be implemented based on multinomial, like:

In []: s= multinomial(54, [.3, .4, .3])
In []: [item* s[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccc']

IMHO here we have a nice summary of empirical cdf and multinomial based sampling yielding similar results. So, in a summary, pick it up one which suits best for your purposes.

回复收藏 0 原文

雨后彩虹 2024-11-23 16:08:11

这可能有边际效益，但我是这样做的：

import scipy.stats as sps
N=1000
M3 = sps.multinomial.rvs(1, p = [0.3,0.4,0.3], size=N, random_state=None)
M3a = [ np.where(r==1)[0][0] for r in M3 ] # convert 1-hot encoding to integers

这与@eat的答案类似。

This may be of marginal benefit but I did it this way:

import scipy.stats as sps
N=1000
M3 = sps.multinomial.rvs(1, p = [0.3,0.4,0.3], size=N, random_state=None)
M3a = [ np.where(r==1)[0][0] for r in M3 ] # convert 1-hot encoding to integers

This is similar to @eat's answer.

回复收藏 0 原文

~没有更多了~