高效的项目分箱算法（itertools/numpy）

发布于 2024-11-25 05:28:19 字数 1020 浏览 1 评论 0原文

我认为这是一个常见的组合问题，但我似乎找不到它的名称或任何有关它的材料。我正在 Python 和 numpy 中执行此操作，但如果有一个快速矩阵方法，我可能可以翻译。

基本上，给定 n 个物品，我需要生成将它们放入 m 个箱子中的所有方法。例如，将 4 个项目分装到 3 个容器中将得到类似 [(4, 0, 0), (3, 1, 0), (3, 0, 1), (2, 2, 0), (2,1,1),...]。这是一个总量固定的产品。

使用 itertools 实现这一点很简单。

import itertools

def fixed_total_product(bins, num_items):
""" Return iterator of all item binning possibilities. """
return itertools.ifilter(lambda combo: sum(combo) == num_items,
                         itertools.product(xrange(num_items + 1), repeat=bins))

不幸的是，我认为在循环中进行后续计算效率很低。稍后将其作为 2D numpy 数组使用会更快，但我无法找到一种有效的方法来构建数组。我可以迭代 ifilter 结果，构建可能性列表，并使用它来构造数组，但这似乎是一个巨大的浪费。

我猜测最好的方法是“以 numpy 方式”构建所有内容，但我不确定如何做到这一点。 stackoverflow 上有一个快速的产品实现：使用numpy构建两个数组的所有组合的数组。我猜你只能修改它以输出具有正确总和的产品。数组的大小应为 ((m-1) + n) 选择 n，因为有 m-1 个 bin 边界。

有什么想法吗？基准非常受欢迎，但不是必需的。

原文

I think this is a common combinatorics problem, but I can't seem to find a name for it or any material about it. I am doing this in Python and numpy, but if there is a fast matrix method for this, I can probably translate.

Basically, given n items, I need to generate all ways to put them in m bins. As an example, binning 4 items into 3 bins would give something like [(4, 0, 0), (3, 1, 0), (3, 0, 1), (2, 2, 0), (2, 1, 1), ...]. This is a product with a fixed total.

Implementing this with itertools is straightforward.

import itertools

def fixed_total_product(bins, num_items):
""" Return iterator of all item binning possibilities. """
return itertools.ifilter(lambda combo: sum(combo) == num_items,
                         itertools.product(xrange(num_items + 1), repeat=bins))

Unfortunately, I think doing subsequent calculations in loops will be inefficient. Working with this as a 2D numpy array would be faster later on, but I can't figure out an efficient way to build an array with this. I could iterate over the ifilter result, building a list of the possibilities, and use this to construct the array, but that seems like a huge waste.

I'm guessing the best way to do this is to build everything "the numpy way", but I'm not sure how to do this. There is a speedy product implementation on stackoverflow: Using numpy to build an array of all combinations of two arrays. I'm guessing you can modify this only to output products with the right sum. The size of the array should be ((m-1) + n) choose n, since there are m-1 bin boundaries.

Any ideas? Benchmarks much appreciated, but not required.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瑕疵 2024-12-02 05:28:19

基于递归 C(n, k) = C(n - 1, k) + C(n - 1, k - 1)，使用 numpy 运算进行记忆。

import numpy as np

def binnings(n, k, cache={}):
    if n == 0:
        return np.zeros((1, k))
    if k == 0:
        return np.empty((0, 0))
    args = (n, k)
    if args in cache:
        return cache[args]
    a = binnings(n - 1, k, cache)
    a1 = a + (np.arange(k) == 0)
    b = binnings(n, k - 1, cache)
    b1 = np.hstack((np.zeros((b.shape[0], 1)), b))
    result = np.vstack((a1, b1))
    cache[args] = result
    return result

if __name__ == '__main__':
    import timeit
    print timeit.timeit('binnings(20, 5, {})', setup='from __main__ import binnings', number=1)

(20, 5) 的时间（以秒为单位）：

0.0129251480103

Based on the recursion C(n, k) = C(n - 1, k) + C(n - 1, k - 1), memoized, using numpy operations.

import numpy as np

def binnings(n, k, cache={}):
    if n == 0:
        return np.zeros((1, k))
    if k == 0:
        return np.empty((0, 0))
    args = (n, k)
    if args in cache:
        return cache[args]
    a = binnings(n - 1, k, cache)
    a1 = a + (np.arange(k) == 0)
    b = binnings(n, k - 1, cache)
    b1 = np.hstack((np.zeros((b.shape[0], 1)), b))
    result = np.vstack((a1, b1))
    cache[args] = result
    return result

if __name__ == '__main__':
    import timeit
    print timeit.timeit('binnings(20, 5, {})', setup='from __main__ import binnings', number=1)

Time in seconds for (20, 5):

0.0129251480103

回复收藏 0 原文

我是有多爱你 2024-12-02 05:28:19

在 numpy 中使用一些不同的技巧可能有一种更快的方法。 numpy.indices 是您想要开始的地方。一旦将它与 rollaxis 结合起来，它本质上就相当于 itertools.product。 Sven Marnach 在这个问题中的回答就是一个很好的例子（他的最后一个问题中有一个小错误）但是，您想要使用的示例应该是 numpy.indices((len(some_list) + 1), * some_length...)

但是，对于这样的东西，使用 itertools 可能

比显式将内容转换为列表更具可读性，特别是如果您给出迭代器中的项目数，这是主要优点。是使用相当少的内存，这在大型迭代器的情况下非常有用，

下面是一些使用 numpy.indices 技巧和将迭代器转换为 numpy 数组的各种方法的示例：

import itertools
import numpy as np
import scipy.special


def fixed_total_product(bins, num_items):
    return itertools.ifilter(lambda combo: sum(combo) == num_items,
            itertools.product(xrange(num_items + 1), repeat=bins))

def fixed_total_product_fromiter(bins, num_items):
    size = scipy.special.binom(bins - 1 + num_items, num_items)
    combinations = fixed_total_product(bins, num_items)
    indicies = (idx for row in combinations for idx in row)
    arr = np.fromiter(indicies, count=bins * int(size), dtype=np.int)
    return arr.reshape(-1, bins)

def fixed_total_product_fromlist(bins, num_items):
    return np.array(list(fixed_total_product(bins, num_items)), dtype=np.int)

def fixed_total_product_numpy(bins, num_items):
    arr = np.rollaxis(np.indices((num_items + 1,) * bins), 0, bins + 1)
    arr = arr.reshape(-1, bins)
    arr = np.arange(num_items + 1)[arr]
    sums = arr.sum(axis=1)
    return arr[sums == num_items]

m, n = 5, 20

if __name__ == '__main__':
    import timeit
    list_time = timeit.timeit('fixed_total_product_fromlist(m, n)',
            setup='from __main__ import fixed_total_product_fromlist, m, n',
            number=1)
    iter_time = timeit.timeit('fixed_total_product_fromiter(m, n)',
            setup='from __main__ import fixed_total_product_fromiter, m, n',
            number=1)
    numpy_time = timeit.timeit('fixed_total_product_numpy(m, n)',
            setup='from __main__ import fixed_total_product_numpy, m, n',
            number=1)

    print 'All combinations of {0} items drawn from a set of {1} items...'.format(m,n)
    print '  Converting to a list and then an array: {0} sec'.format(list_time)
    print '  Using fromiter: {0} sec'.format(iter_time)
    print '  Using numpy.indices: {0} sec'.format(numpy_time)

至于时机：

All combinations of 5 items drawn from a set of 20 items...
  Converting to a list and then an array: 2.75901389122 sec
  Using fromiter: 2.10619592667 sec
  Using numpy.indices: 1.44955015182 sec

您会注意到它们都不是特别快，

您使用的是强力算法（生成所有可能的组合然后过滤它们），我只是将其复制到基于 numpy 的示例中。

可能有一种更有效的方法来做到这一点！然而，我无论如何都不是计算机科学或数学专家，所以我不知道是否有一个众所周知的算法可以在不首先生成所有可能组合的情况下执行此操作......

无论如何，祝你好运！

There probably is a faster way using a few different tricks in numpy. numpy.indices is where you want to start. It's essentially the equivalent of itertools.product, once you combine it with rollaxis. Sven Marnach's answer in this question is an excellent example of this (There's a minor error in his last example, however, which is what you want to use. It should be numpy.indices((len(some_list) + 1), * some_length...)

However, for something like this, it's likely to be more readable using itertools.

numpy.fromiter is a bit faster than explicitly converting things to a list, especially if you give it a count of the number of items in the iterator. The main advantage is that is uses considerably less memory, which can be very helpful in the case of large iterators.

Here are some examples using both the numpy.indices trick and various ways of converting the iterator to a numpy array:

import itertools
import numpy as np
import scipy.special


def fixed_total_product(bins, num_items):
    return itertools.ifilter(lambda combo: sum(combo) == num_items,
            itertools.product(xrange(num_items + 1), repeat=bins))

def fixed_total_product_fromiter(bins, num_items):
    size = scipy.special.binom(bins - 1 + num_items, num_items)
    combinations = fixed_total_product(bins, num_items)
    indicies = (idx for row in combinations for idx in row)
    arr = np.fromiter(indicies, count=bins * int(size), dtype=np.int)
    return arr.reshape(-1, bins)

def fixed_total_product_fromlist(bins, num_items):
    return np.array(list(fixed_total_product(bins, num_items)), dtype=np.int)

def fixed_total_product_numpy(bins, num_items):
    arr = np.rollaxis(np.indices((num_items + 1,) * bins), 0, bins + 1)
    arr = arr.reshape(-1, bins)
    arr = np.arange(num_items + 1)[arr]
    sums = arr.sum(axis=1)
    return arr[sums == num_items]

m, n = 5, 20

if __name__ == '__main__':
    import timeit
    list_time = timeit.timeit('fixed_total_product_fromlist(m, n)',
            setup='from __main__ import fixed_total_product_fromlist, m, n',
            number=1)
    iter_time = timeit.timeit('fixed_total_product_fromiter(m, n)',
            setup='from __main__ import fixed_total_product_fromiter, m, n',
            number=1)
    numpy_time = timeit.timeit('fixed_total_product_numpy(m, n)',
            setup='from __main__ import fixed_total_product_numpy, m, n',
            number=1)

    print 'All combinations of {0} items drawn from a set of {1} items...'.format(m,n)
    print '  Converting to a list and then an array: {0} sec'.format(list_time)
    print '  Using fromiter: {0} sec'.format(iter_time)
    print '  Using numpy.indices: {0} sec'.format(numpy_time)

As for timing:

All combinations of 5 items drawn from a set of 20 items...
  Converting to a list and then an array: 2.75901389122 sec
  Using fromiter: 2.10619592667 sec
  Using numpy.indices: 1.44955015182 sec

You'll notice that none of them are particularly fast.

You're using a brute-force algorithm (generate all possible combinations and then filter them), and I'm just copying it in the numpy-based example here.

There is probably a much more efficient way to do this! However, I'm not a CS or math person by any means, so I don't know if there's a well-known algorithm to do this without generating all possible combinations first...

Good luck, at any rate!

回复收藏 0 原文

以为你会在 2024-12-02 05:28:19

我知道我在这里恢复了一个相当旧的线程，但我希望一个好的解决方案仍然会受到赞赏（即使不再被OP所赞赏）。

问题本身类似于在代码中表示多元多项式。想想例如（对应于 3 个容器中有 4 个项目的示例）您通过展开 (x+y+z)^4 得到的多项式，类似于 x^4*y^ 0*z^0 + 4*x^3*y^1*z^0 + ... 以及 x、y 和 z 的指数是数组的元素：

[4,0,0],
[3,1,0],
[3,0,1],
[2,2,0],
[2,1,1],
[2,0,2],
[1,3,0],
[

如果仔细观察，您会发现看到例如右边的元素3 是 (y+z)^1 的指数，2 右侧是 (y+ z)^2 等等。更一般地说，在 p 的右侧有 (y+z)^(4-p) 的指数。这表明了一些递归结构：假设 e(m,p) 表示 m 变量的指数 (=bins) 和 p 总指数 (=items)，那么你得到通过选择从 0 到 p 的第一个指数 q 以及其余 m-1 的指数来获取所有这些> 变量由下式给出e(m-1,pq)。

在 python 和 numpy 中，您可以这样表述：

def multiindex_exact(m, p):
    if m == 0:
        return np.zeros((1 if p==0 else 0, 0), np.int8)
    else:
        I = np.zeros((0, m), np.int8)
        for q in reversed(range(0, p + 1)):
            J = multiindex_exact(m - 1, p - q)
            Jn = np.full((J.shape[0], 1), q)
            I = np.vstack((I, np.hstack((Jn, J))))
        return I

当然，您可以通过预先计算数组大小并直接填充值来提高效率。

如果您需要将最多的p个项目分配到m个容器中，但容器的总和始终为><=p，您可以使用以下类似的代码。

def multiindex_lower(m, p):
    if m == 0:
        return np.zeros((1, 0), np.int8)
    else:
        I = np.zeros((0, m), np.int8)
        for q in range(0, p + 1):
            J = multiindex_lower(m - 1, q)
            Jn = q - J.sum(1).reshape((J.shape[0], 1))
            I = np.vstack((I, np.hstack((J, Jn))))
        return I

I know that I'm reviving a pretty old thread here, but I hope a good solution will be still appreciated (even if not by the OP anymore).

The problem itself is akin to representing multivariate polynomials in code. Think e.g. (corresponding to the example with 4 items in 3 bins) of the polynomial you get from expanding (x+y+z)^4 which is something like x^4*y^0*z^0 + 4*x^3*y^1*z^0 + ... and the exponents to x, y, and z are the elements of the array:

[4,0,0],
[3,1,0],
[3,0,1],
[2,2,0],
[2,1,1],
[2,0,2],
[1,3,0],
[

If you look closely, you see that e.g. the elements to the right of the 3 are the exponents of (y+z)^1, to the right of 2 are the exponents of (y+z)^2 and so on. More general, to the right of p you have the exponents of (y+z)^(4-p). This indicates some recursive structure: Let's say e(m,p) indicates the exponents for m variables (=bins) and p total exponent (=items), then you get all of them by choosing the first exponent q from 0 to p, and the exponents of the remaining m-1 variables are given by e(m-1,p-q).

In python with numpy you can formulate that like this:

def multiindex_exact(m, p):
    if m == 0:
        return np.zeros((1 if p==0 else 0, 0), np.int8)
    else:
        I = np.zeros((0, m), np.int8)
        for q in reversed(range(0, p + 1)):
            J = multiindex_exact(m - 1, p - q)
            Jn = np.full((J.shape[0], 1), q)
            I = np.vstack((I, np.hstack((Jn, J))))
        return I

Of course, you can make that more efficient by precalculating array sizes, and filling in the values directly.

If you need to distribute not exactly, but a maximum of p items into m bins, so that the sum over the bins is always <=p, you can use the following similar code.

def multiindex_lower(m, p):
    if m == 0:
        return np.zeros((1, 0), np.int8)
    else:
        I = np.zeros((0, m), np.int8)
        for q in range(0, p + 1):
            J = multiindex_lower(m - 1, q)
            Jn = q - J.sum(1).reshape((J.shape[0], 1))
            I = np.vstack((I, np.hstack((J, Jn))))
        return I

回复收藏 0 原文

~没有更多了~