按 numpy 数组中的最大值或最小值分组

发布于 2024-12-22 15:23:39 字数 405 浏览 2 评论 0原文

我有两个等长的一维 numpy 数组,iddata,其中 id 是定义子窗口的重复有序整数序列关于数据。例如:

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10

我想通过对 id 进行分组并取最大值或最小值来聚合数据

在 SQL 中,这将是一个典型的聚合查询,例如 SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id。

有没有办法可以避免 Python 循环并以矢量化方式执行此操作?

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10

I would like to aggregate data by grouping on id and taking either the max or the min.

In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.

Is there a way I can avoid Python loops and do this in a vectorized manner?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

琉璃繁缕 2024-12-29 15:23:39

过去几天我在堆栈溢出上看到了一些非常类似的问题。以下代码与 numpy.unique 的实现非常相似,并且因为它利用了底层 numpy 机制,所以它很可能比在 python 循环中执行的任何操作都要快。

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]
泪意 2024-12-29 15:23:39

在纯Python中:

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

一种变体:

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

基于@Bago的答案

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

如果pandas 已安装:

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1

In pure Python:

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

A variation:

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

Based on @Bago's answer:

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

If pandas is installed:

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1
怼怹恏 2024-12-29 15:23:39

仅使用 numpy 且不使用循环:

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE

with only numpy and without loops:

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE
风启觞 2024-12-29 15:23:39

我对 Python 和 Numpy 相当陌生,但是,似乎您可以使用 ufunc 的 .at 方法而不是 reduceat

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

:示例:

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

当然,只有当您的 data_id 值适合用作索引(即非负整数并且不是很大......大概如果它们很大/稀疏时,您可以初始化 时,这才有意义ans 使用np.unique(data_id) 或其他)。

我应该指出,data_id 实际上不需要排序。

I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

For example:

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).

I should point out that the data_id doesn't actually need to be sorted.

寄居者 2024-12-29 15:23:39

我已将之前答案的一个版本打包在 numpy_indexed 包中;很高兴将这一切都封装在一个简洁的界面中并进行测试;另外它还有更多的功能:

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

等等

Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

And so on

苄①跕圉湢 2024-12-29 15:23:39

比已经接受的答案稍微更快、更普遍的答案;就像 joeln 的答案一样,它避免了更昂贵的 lexsort,并且它适用于任意 ufunc。此外,它只要求键是可排序的,而不是特定范围内的整数。不过,考虑到未明确计算最大/最小值,接受的答案可能仍然更快。忽略已接受解决方案的 nan 的能力很巧妙;但也可以简单地为 nan 值分配一个虚拟密钥。

import numpy as np

def group(key, value, operator=np.add):
    """
    group the values by key
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
    """
    #upcast to numpy arrays
    key = np.asarray(key)
    value = np.asarray(value)
    #first, sort by key
    I = np.argsort(key)
    key = key[I]
    value = value[I]
    #the slicing points of the bins to sum over
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
    #first entry of each bin is a unique key
    unique_keys = key[slices]
    #reduce over the slices specified by index
    per_key_sum = operator.reduceat(value, slices)
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin
    key_count = np.diff(np.append(slices, len(key)))
    return unique_keys, per_key_sum, key_count


names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values

A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.

import numpy as np

def group(key, value, operator=np.add):
    """
    group the values by key
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
    """
    #upcast to numpy arrays
    key = np.asarray(key)
    value = np.asarray(value)
    #first, sort by key
    I = np.argsort(key)
    key = key[I]
    value = value[I]
    #the slicing points of the bins to sum over
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
    #first entry of each bin is a unique key
    unique_keys = key[slices]
    #reduce over the slices specified by index
    per_key_sum = operator.reduceat(value, slices)
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin
    key_count = np.diff(np.append(slices, len(key)))
    return unique_keys, per_key_sum, key_count


names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
篱下浅笙歌 2024-12-29 15:23:39

我认为这完成了您正在寻找的内容:

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

对于外部列表理解,从右到左,set(id)id进行分组,sorted( ) 对它们进行排序,for k ... 对它们进行迭代,而 max 在本例中取另一个列表理解的最大值。因此,转向内部列表理解:enumerate(data) 返回 data 中的索引和值,if id[val] == k挑选出与 id k 对应的 data 成员。

这会迭代每个 id 的完整 data 列表。通过对子列表进行一些预处理,也许可以加快速度,但它不会是一句简单的话。

I think this accomplishes what you're looking for:

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.

This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.

深海夜未眠 2024-12-29 15:23:39

以下解决方案仅需要对数据进行排序(而不是词法排序),并且不需要查找组之间的边界。它依赖于这样一个事实:如果 or 的索引数组,则 r[o] = x 将填充 r< /code> 为 o 的每个值添加最新值 x,这样 r[[0, 0]] = [1, 2] 将返回r[0] = 2。它要求您的组是从 0 到组数 - 1 的整数,就像 numpy.bincount 一样,并且每个组都有一个值:

def group_min(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)[::-1]
    result[groups.take(order)] = data.take(order)
    return result

def group_max(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)
    result[groups.take(order)] = data.take(order)
    return result

The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:

def group_min(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)[::-1]
    result[groups.take(order)] = data.take(order)
    return result

def group_max(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)
    result[groups.take(order)] = data.take(order)
    return result
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文