Python 按数组 a 分组，并对数组 b 进行汇总 - 性能

发布于 2024-12-06 07:14:08 字数 1436 浏览 1 评论 0原文

给定两个长度相同的无序数组 a 和 b：

a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

我想按 a 中的元素进行分组：

aResult = [7,3,5]

对 b 中的元素求和（用于总结概率密度函数的示例）：

bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]

或者，在 python 中随机 a 和 b：

import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))

I有两种方法，这肯定远离较低的性能边界。方法 1（至少又好又短）：时间：0.769315958023

def approach_2(a,b):
    bResult = [sum(b[i == a]) for i in np.unique(a)]
    aResult = np.unique(a)

方法 2（numpy.groupby，非常慢）时间：4.65299129486

def approach_2(a,b): 
    tmp = [(a[i],b[i]) for i in range(len(a))]
    tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
    tmp2 = np.sort(tmp2, order='a') 

    bResult = []
    aResult = []
    for key, group in groupby(tmp2, lambda x: x[0]):
        aResult.append(key)
        bResult.append(sum([i[1] for i in group]))

更新：方法 3，作者：Pablo。时间：1.0265750885

def approach_Pablo(a,b):    

    pdf = defaultdict(int); 
    for x,y in zip(a,b):
        pdf[x] += y

更新：方法 4，作者：Unutbu。时间：0.184849023819 [到目前为止的获胜者，但仅作为整数]

def unique_Unutbu(a,b):

    x=np.bincount(a,weights=b)
    aResult = np.unique(a)
    bResult = x[aResult]

也许有人找到比我更聪明的解决方案来解决这个问题:)

原文

Given two unordered arrays of same lengths a and b:

a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

I'd like to group by the elements in a:

aResult = [7,3,5]

summing over the elements in b (Example used to summarize a probability density function):

bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]

Alternatively, random a and b in python:

import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))

I have two approaches, which for sure are far from the lower performance boundary.
Approach 1 (at least nice and short): Time: 0.769315958023

def approach_2(a,b):
    bResult = [sum(b[i == a]) for i in np.unique(a)]
    aResult = np.unique(a)

Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486

def approach_2(a,b): 
    tmp = [(a[i],b[i]) for i in range(len(a))]
    tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
    tmp2 = np.sort(tmp2, order='a') 

    bResult = []
    aResult = []
    for key, group in groupby(tmp2, lambda x: x[0]):
        aResult.append(key)
        bResult.append(sum([i[1] for i in group]))

Update: Approach3, by Pablo. Time: 1.0265750885

def approach_Pablo(a,b):    

    pdf = defaultdict(int); 
    for x,y in zip(a,b):
        pdf[x] += y

Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]

def unique_Unutbu(a,b):

    x=np.bincount(a,weights=b)
    aResult = np.unique(a)
    bResult = x[aResult]

Maybe someone finds a smarter solution to this problem than me :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

场罚期间 2024-12-13 07:14:08

这是类似于 @unutbu 的方法< /a>：

import numpy as np

def f(a, b):
    result_a, inv_ndx = np.unique(a, return_inverse=True)
    result_b = np.bincount(inv_ndx, weights=b)
    return result_a, result_b

它允许 a 数组使用非整数类型。它允许a数组中存在较大的值。它按排序顺序返回 a 元素。如果需要，可以使用 np.unique() 函数的 return_index 参数轻松恢复原始顺序。

随着 a 中唯一元素数量的增加，其性能会恶化。对于您问题中的数据，它比 @unutbu 的版本慢 4 倍。

我与另外三种方法进行了性能比较。领先者是：对于整数数组 - Cython 中基于哈希的实现；对于 double 数组（对于输入大小 10000） -- 基于排序的实现. 也在 Cython 中。

Here's approach similar to @unutbu's one:

import numpy as np

def f(a, b):
    result_a, inv_ndx = np.unique(a, return_inverse=True)
    result_b = np.bincount(inv_ndx, weights=b)
    return result_a, result_b

It allows non-integer type for a array. It allows large values in a array. It returns a elements in a sorted order. If it is desirable it easy to restore original order usingreturn_index argument of np.unique() function.

It performance worsens as the number of unique elements in a rises. It 4 times slower than @unutbu's version on the data from your question.

I've made performance comparison with additional three methods. The leaders are: for integer arrays -- hash-based implementation in Cython; for double arrays (for input size 10000) -- sort-based impl. also in Cython.

回复收藏 0 原文

腹黑女流氓 2024-12-13 07:14:08

如果 a 由整数组成 a 2**31-1（也就是说，如果 a 具有适合 dtype int32 的值），那么您可以使用 np.bincount权重：

import numpy as np
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

x=np.bincount(a,weights=b)
print(x)
# [ 0.   0.   0.   0.1  0.   0.4  0.   0.5]

print(x[[7,3,5]])
# [ 0.5  0.1  0.4]

np.unique(a) 返回[3 5 7]，因此结果以不同的顺序显示：

print(x[np.unique(a)])
# [ 0.1  0.4  0.5]

使用np.bincount 的一个潜在问题 是它返回一个长度等于a中的最大值的数组。如果 a 包含一个值接近 2**31-1 的元素，则 bincount 必须分配一个大小为 8*(2**31 -1) 字节（或 16GiB）。

因此，对于长度较大但值不大的数组，np.bincount 可能是最快的解决方案。对于长度较小（且值较大或较小）的数组 a，使用 collections.defaultdict 可能会更快。

编辑：参见 JF Sebastian 的解决方案解决仅限整数值限制和大值问题的方法。

If a is composed of ints < 2**31-1 (that is, if a has values that can fit in dtype int32), then you could use np.bincount with weights:

import numpy as np
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

x=np.bincount(a,weights=b)
print(x)
# [ 0.   0.   0.   0.1  0.   0.4  0.   0.5]

print(x[[7,3,5]])
# [ 0.5  0.1  0.4]

np.unique(a) returns [3 5 7], so the result appears in a different order:

print(x[np.unique(a)])
# [ 0.1  0.4  0.5]

One potential problem with using np.bincount is that it returns an array whose length is equal to the maximum value in a. If a contains even one element with value near 2**31-1, then bincount would have to allocate an array of size 8*(2**31-1) bytes (or 16GiB).

So np.bincount might be the fastest solution for arrays a which have big length, but not big values. For arrays a which have small length (and big or small values), using a collections.defaultdict would probably be faster.

Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.

回复收藏 0 原文

最初的梦 2024-12-13 07:14:08

这种方法怎么样：

from collections import defaultdict
pdf = defaultdict(int)
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
for x,y in zip(a,b):
  pdf[x] += y

您只需迭代每个元素一次并使用字典进行快速查找。如果您确实想要两个单独的数组作为最终结果，您可以要求它们：

aResult = pdf.keys()
bResult = pdf.values()

How about this approach:

from collections import defaultdict
pdf = defaultdict(int)
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
for x,y in zip(a,b):
  pdf[x] += y

You only iterate over each element once and use a dictionary for fast lookup. If you really want two separate arrays as results in the end you can ask for them:

aResult = pdf.keys()
bResult = pdf.values()

回复收藏 0 原文

~没有更多了~

关于作者

甜嗑

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

Python 按数组 a 分组，并对数组 b 进行汇总 - 性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

无爱即是神

mdong

放手`

泅人

猫七

Reina

友情链接

Python 按数组 a 分组，并对数组 b 进行汇总 - 性能

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

无爱即是神

mdong

放手`

泅人

猫七

Reina

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。