python/numpy-变量的条件采样，后续值的分布基于先前值的结果

发布于 2025-02-12 08:50:50 字数 879 浏览 0 评论 0原文

我正在尝试生成一个随机的多个变量的样本，这些变量彼此之间存在松散相关。这意味着某些变量的“允许”值取决于为另一个变量设置的值。

为了简单起见，让我们想象我只有两个变量 - A和B，可以说它们两个都有统一或高斯分布（我们真的不在乎他们遵循哪个确切的分布，并且可以接受两者）。对于讨论，让我们假设两者都有均匀的分布。

假设变量a可以在0到100之间取任何值。我们可以轻松地从此分布中采样1000个数据点。

现在，我们还希望生成变量B的值，该值可以在50和150之间进行任何值。捕获的结果是，所产生的样本中存在限制 - 值A和B的总和必须在60至160之间最后的收获是，

每次我们运行采样过程的精确界限都在变化（例如，在一个情况下，A可以在上述0到100之间，第二天需要在-10到75之间）。基本上，从每天到一天，采样的精确边界正在发展。

目前，我们以非常低效的方式进行操作 - 独立生成A和B值的完全随机网格，而不是消除所有无法满足我们指定的约束并在后续步骤中使用它们的A和B组合。例如，这样的网格看起来像：

实际上，我们有很多变量（30多个）和我们应用的大量约束。完全随机的网格产生会导致实例，在应用所有约束后，如果我们不使用足够大的样本量，我们最终都没有满足所有约束的情况 - 并确保我们始终至少有一些点，我们需要与数百万的网格产生网格点。除此之外，每次我们重新运行采样过程时，我们都会得到不同的结果数据集 - 有时所有要点都被消除了，有时我们会得到10分，有时是1000

。以“统计上正确的方式”，理想情况下，以一种使我们可以指定满足我们想要在一天结束时要获得的所有约束的样品点数的方式。对某些代码示例的任何指导或指示都将不胜感激。

原文

I am trying to generate a random sample of multiple variables which are loosely related to each other. Meaning that "allowed" values of some variables depend on the value which is set for another variable.

For simplicity let's imagine that I have just two variables - A and B and let's say that both of them have uniform or gaussian distribution (we don't really care which exact distribution they follow and can accept both). For discussion let's assume both have uniform distribution.

Let's say that variable A can take any value between 0 and 100. We can easily sample from this distribution, say, 1000 data points.

Now, we also want to generate values for variable B, which can take any value between, say, 50 and 150. The catch here is that there is a constraint in resulting sample - sum of values A and B must be between 60 and 160.

Final catch is that each time we run the sampling process precise boundaries of sampling are changing (for example in one case A can be between 0 and 100 as above, next day it needs to be between -10 and 75 etc). Basically from day to day precise boundaries of sampling are evolving.

Right now we do it in a very inefficient way - generate completely random grid of A and B values independently, than eliminate all of the A and B combinations which don't satisfy constraints which we specify and than use them in subsequent steps. For example such grid could look like:

However, as you guess it is super-inefficient. In reality we have a lot of variables (30+) and large set of constraints we apply. Completely random generation of grid leads to instances where after applying all constraints we end up with no points satisfying all constraints if we don't use large enough sample size - and to ensure we always have at least some points we need to generate grid with millions points. Beyond that each time we re-run the sampling procedure we get different resulting dataset - sometimes all points are getting eliminated, sometimes we get 10 points as result and sometimes - 1000.

So my question is - is there a way to do it more efficiently in a "statistically correct way", ideally in a way which will allow us to specify how many sample points satisfying all constraints we want to get in the end of the day. Any guidance or pointers to some code examples will be much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

稀香 2025-02-19 08:50:50

我不确定您正在做的事情是否有完全不同的方法（这有点 noreflow noreferrer“>拒绝采样<<< /a>）。但是，您绝对可以比描述的更有效的方式做到这一点，例如不会事先产生很多组合，而是一代人一次拒绝它们。

也许这可能会有所帮助：

定义变量的边界，以及评估您对它们的约束的函数。在这里，我正在使用您的示例中的值。可以轻松添加更多变量和约束。

minima = [0, 50]
maxima = [100, 150]


def constraints(a, b):
    # input are arrays of random numbers for each variable
    # returns boolean mask for indexing
    return ((a + b) > 60) & ((a + b) < 160)

然后，您可以生成一批随机数，并以矢量化的方式评估约束。根据约束的维度和复杂性，这可能会拒绝大量的值，但至少您不提前存储它们，并且可以定义所需的样本数。

def sample_numbers(constraints, num_samples, minima, maxima, batch=10000):
    samples = np.zeros(shape=(num_samples + batch, len(minima)), dtype='int64')
    n_accept = 0
    while n_accept < num_samples:
        # sample from discrete uniform distributions
        a = scipy.stats.randint.rvs(low=minima[0], high=maxima[0], size=batch)
        b = scipy.stats.randint.rvs(low=minima[1], high=maxima[1], size=batch)
        # vectorised check where the constraints are fulfilled
        evaluate_constraints = constraints(a, b)
        # number of accepted combinations in this batch
        n_accept_update = n_accept + sum(evaluate_constraints)
        # transfer accepted combinations 
        samples[n_accept: n_accept_update] = np.stack((a[evaluate_constraints], b[evaluate_constraints])).T
        n_accept = n_accept_update
    return samples[:num_samples]

sampled_numbers = sample_numbers(constraints=constraints, num_samples=100000, minima=minima, maxima=maxima, batch=1000)

I'm not sure there is an entirely different approach to what you are doing (which is kind of Rejection Sampling). But you could definitely do it in a more efficient way than you describe, e.g. not generate lots of combinations beforehand and reject them once after generation.

Maybe this could help:

Define boundaries of your variables, and a function that evaluates the constraints that you put on them. Here I am using the values from your example. More variables and constraints can be added easily.

minima = [0, 50]
maxima = [100, 150]


def constraints(a, b):
    # input are arrays of random numbers for each variable
    # returns boolean mask for indexing
    return ((a + b) > 60) & ((a + b) < 160)

Then you could generate batches of random numbers and evaluate in a vectorised way whether the constraints are fulfilled. Depending on the dimensionality and complexity of your constraints this might reject plenty of values, but at least you don't store them all in advance and you can define the desired number of samples.

def sample_numbers(constraints, num_samples, minima, maxima, batch=10000):
    samples = np.zeros(shape=(num_samples + batch, len(minima)), dtype='int64')
    n_accept = 0
    while n_accept < num_samples:
        # sample from discrete uniform distributions
        a = scipy.stats.randint.rvs(low=minima[0], high=maxima[0], size=batch)
        b = scipy.stats.randint.rvs(low=minima[1], high=maxima[1], size=batch)
        # vectorised check where the constraints are fulfilled
        evaluate_constraints = constraints(a, b)
        # number of accepted combinations in this batch
        n_accept_update = n_accept + sum(evaluate_constraints)
        # transfer accepted combinations 
        samples[n_accept: n_accept_update] = np.stack((a[evaluate_constraints], b[evaluate_constraints])).T
        n_accept = n_accept_update
    return samples[:num_samples]

sampled_numbers = sample_numbers(constraints=constraints, num_samples=100000, minima=minima, maxima=maxima, batch=1000)

回复收藏 0 原文

~没有更多了~