Python 按数组 a 分组,并对数组 b 进行汇总 - 性能
给定两个长度相同的无序数组 a 和 b:
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
我想按 a 中的元素进行分组:
aResult = [7,3,5]
对 b 中的元素求和(用于总结概率密度函数的示例):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
或者,在 python 中随机 a 和 b:
import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))
I有两种方法,这肯定远离较低的性能边界。 方法 1(至少又好又短):时间:0.769315958023
def approach_2(a,b):
bResult = [sum(b[i == a]) for i in np.unique(a)]
aResult = np.unique(a)
方法 2(numpy.groupby,非常慢)时间:4.65299129486
def approach_2(a,b):
tmp = [(a[i],b[i]) for i in range(len(a))]
tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
tmp2 = np.sort(tmp2, order='a')
bResult = []
aResult = []
for key, group in groupby(tmp2, lambda x: x[0]):
aResult.append(key)
bResult.append(sum([i[1] for i in group]))
更新:方法 3,作者:Pablo。时间:1.0265750885
def approach_Pablo(a,b):
pdf = defaultdict(int);
for x,y in zip(a,b):
pdf[x] += y
更新:方法 4,作者:Unutbu。时间:0.184849023819 [到目前为止的获胜者,但仅作为整数]
def unique_Unutbu(a,b):
x=np.bincount(a,weights=b)
aResult = np.unique(a)
bResult = x[aResult]
也许有人找到比我更聪明的解决方案来解决这个问题:)
Given two unordered arrays of same lengths a and b:
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
I'd like to group by the elements in a:
aResult = [7,3,5]
summing over the elements in b (Example used to summarize a probability density function):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
Alternatively, random a and b in python:
import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))
I have two approaches, which for sure are far from the lower performance boundary.
Approach 1 (at least nice and short): Time: 0.769315958023
def approach_2(a,b):
bResult = [sum(b[i == a]) for i in np.unique(a)]
aResult = np.unique(a)
Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486
def approach_2(a,b):
tmp = [(a[i],b[i]) for i in range(len(a))]
tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
tmp2 = np.sort(tmp2, order='a')
bResult = []
aResult = []
for key, group in groupby(tmp2, lambda x: x[0]):
aResult.append(key)
bResult.append(sum([i[1] for i in group]))
Update: Approach3, by Pablo. Time: 1.0265750885
def approach_Pablo(a,b):
pdf = defaultdict(int);
for x,y in zip(a,b):
pdf[x] += y
Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]
def unique_Unutbu(a,b):
x=np.bincount(a,weights=b)
aResult = np.unique(a)
bResult = x[aResult]
Maybe someone finds a smarter solution to this problem than me :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是类似于 @unutbu 的方法< /a>:
它允许
a
数组使用非整数类型。它允许a
数组中存在较大的值。它按排序顺序返回a
元素。如果需要,可以使用 np.unique() 函数的 return_index 参数轻松恢复原始顺序。随着
a
中唯一元素数量的增加,其性能会恶化。对于您问题中的数据,它比 @unutbu 的版本慢 4 倍。我与另外三种方法进行了性能比较。领先者是: 对于整数数组 - Cython 中基于哈希的实现;对于
double
数组(对于输入大小 10000) -- 基于排序的实现. 也在 Cython 中。Here's approach similar to @unutbu's one:
It allows non-integer type for
a
array. It allows large values ina
array. It returnsa
elements in a sorted order. If it is desirable it easy to restore original order usingreturn_index
argument ofnp.unique()
function.It performance worsens as the number of unique elements in
a
rises. It 4 times slower than @unutbu's version on the data from your question.I've made performance comparison with additional three methods. The leaders are: for integer arrays -- hash-based implementation in Cython; for
double
arrays (for input size 10000) -- sort-based impl. also in Cython.如果
a
由整数组成a
2**31-1(也就是说,如果a
具有适合 dtypeint32
的值),那么您可以使用np.bincount
权重:np.unique(a)
返回[3 5 7]
,因此结果以不同的顺序显示:使用
np.bincount 的一个潜在问题
是它返回一个长度等于a
中的最大值的数组。如果a
包含一个值接近 2**31-1 的元素,则bincount
必须分配一个大小为8*(2**31 -1)
字节(或 16GiB)。因此,对于长度较大但值不大的数组,np.bincount 可能是最快的解决方案。对于长度较小(且值较大或较小)的数组
a
,使用collections.defaultdict
可能会更快。编辑:参见 JF Sebastian 的解决方案 解决仅限整数值限制和大值问题的方法。
If
a
is composed of ints < 2**31-1 (that is, ifa
has values that can fit in dtypeint32
), then you could usenp.bincount
with weights:np.unique(a)
returns[3 5 7]
, so the result appears in a different order:One potential problem with using
np.bincount
is that it returns an array whose length is equal to the maximum value ina
. Ifa
contains even one element with value near 2**31-1, thenbincount
would have to allocate an array of size8*(2**31-1)
bytes (or 16GiB).So
np.bincount
might be the fastest solution for arraysa
which have big length, but not big values. For arraysa
which have small length (and big or small values), using acollections.defaultdict
would probably be faster.Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.
这种方法怎么样:
您只需迭代每个元素一次并使用字典进行快速查找。如果您确实想要两个单独的数组作为最终结果,您可以要求它们:
How about this approach:
You only iterate over each element once and use a dictionary for fast lookup. If you really want two separate arrays as results in the end you can ask for them: