按 numpy 数组中的最大值或最小值分组
我有两个等长的一维 numpy 数组,id
和 data
,其中 id
是定义子窗口的重复有序整数序列关于数据
。例如:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
我想通过对 id
进行分组并取最大值或最小值来聚合数据
。
在 SQL 中,这将是一个典型的聚合查询,例如 SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id。
有没有办法可以避免 Python 循环并以矢量化方式执行此操作?
I have two equal-length 1D numpy arrays, id
and data
, where id
is a sequence of repeating, ordered integers that define sub-windows on data
. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data
by grouping on id
and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id
.
Is there a way I can avoid Python loops and do this in a vectorized manner?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
过去几天我在堆栈溢出上看到了一些非常类似的问题。以下代码与 numpy.unique 的实现非常相似,并且因为它利用了底层 numpy 机制,所以它很可能比在 python 循环中执行的任何操作都要快。
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
在纯Python中:
一种变体:
基于@Bago的答案:
如果
pandas
已安装:In pure Python:
A variation:
Based on @Bago's answer:
If
pandas
is installed:仅使用 numpy 且不使用循环:
with only numpy and without loops:
我对 Python 和 Numpy 相当陌生,但是,似乎您可以使用 ufunc 的 .at 方法而不是 reduceat
:示例:
当然,只有当您的
data_id
值适合用作索引(即非负整数并且不是很大......大概如果它们很大/稀疏时,您可以初始化时,这才有意义ans
使用np.unique(data_id)
或其他)。我应该指出,
data_id
实际上不需要排序。I'm fairly new to Python and Numpy but, it seems like you can use the
.at
method ofufunc
s rather thanreduceat
:For example:
Of course this only makes sense if your
data_id
values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initializeans
usingnp.unique(data_id)
or something).I should point out that the
data_id
doesn't actually need to be sorted.我已将之前答案的一个版本打包在 numpy_indexed 包中;很高兴将这一切都封装在一个简洁的界面中并进行测试;另外它还有更多的功能:
等等
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
And so on
比已经接受的答案稍微更快、更普遍的答案;就像 joeln 的答案一样,它避免了更昂贵的 lexsort,并且它适用于任意 ufunc。此外,它只要求键是可排序的,而不是特定范围内的整数。不过,考虑到未明确计算最大/最小值,接受的答案可能仍然更快。忽略已接受解决方案的 nan 的能力很巧妙;但也可以简单地为 nan 值分配一个虚拟密钥。
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
我认为这完成了您正在寻找的内容:
对于外部列表理解,从右到左,
set(id)
对id
进行分组,sorted( )
对它们进行排序,for k ...
对它们进行迭代,而max
在本例中取另一个列表理解的最大值。因此,转向内部列表理解:enumerate(data)
返回data
中的索引和值,if id[val] == k
挑选出与id
k
对应的data
成员。这会迭代每个
id
的完整data
列表。通过对子列表进行一些预处理,也许可以加快速度,但它不会是一句简单的话。I think this accomplishes what you're looking for:
For the outer list comprehension, from right to left,
set(id)
groups theid
s,sorted()
sorts them,for k ...
iterates over them, andmax
takes the max of, in this case, another list comprehension. So moving to that inner list comprehension:enumerate(data)
returns both the index and value fromdata
,if id[val] == k
picks out thedata
members corresponding toid
k
.This iterates over the full
data
list for eachid
. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.以下解决方案仅需要对数据进行排序(而不是词法排序),并且不需要查找组之间的边界。它依赖于这样一个事实:如果
o
是r
的索引数组,则r[o] = x
将填充r< /code> 为
一样,并且每个组都有一个值:o
的每个值添加最新值x
,这样r[[0, 0]] = [1, 2]
将返回r[0] = 2
。它要求您的组是从 0 到组数 - 1 的整数,就像 numpy.bincountThe following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if
o
is an array of indices intor
thenr[o] = x
will fillr
with the latest valuex
for each value ofo
, such thatr[[0, 0]] = [1, 2]
will returnr[0] = 2
. It requires that your groups are integers from 0 to number of groups - 1, as fornumpy.bincount
, and that there is a value for every group: