聚合数组的 pythonic 方式（numpy 或非 numpy）

发布于 2024-08-13 03:19:09 字数 690 浏览 6 评论 0原文

我想创建一个很好的函数来聚合数组中的数据（它是一个 numpy 记录数组，但它不会改变任何内容）

您有一个想要在一个轴中聚合的数据数组：例如 dtype=[(name, (np.str_,8), (job, (np.str_,8), (venue, np.uint32)] 并且你想知道

我所做的每份工作的平均收入这个函数，在示例中它应该被称为aggregate(data,'job','venue',mean)

def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

问题是我发现它不太好我想把它放在一行：您有什么想法吗？

谢谢您的回答 Louis

PS：我想将 func 保留在通话中，以便您还可以询问中位数、最小值...

原文

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)

def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ι不睡觉的鱼゛ 2024-08-20 03:19:09

也许您正在寻找的函数是 matplotlib.mlab.rec_groupby ：

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

产生

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby 返回一个记录：

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

您可能还有兴趣查看 pandas ，它甚至具有更多功能用于处理分组操作的设施。

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.

回复收藏 0 原文

一曲爱恨情仇 2024-08-20 03:19:09

您的 if k not in data_per_key.keys() 可以重写为 if k not in data_per_key，但您可以使用 defaultdict。这是一个使用 defaultdict 来消除存在检查的版本：

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

回复收藏 0 原文

挽梦忆笙歌 2024-08-20 03:19:09

最好的灵活性和可读性是使用 pandas：

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

产量：

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame 是一个很棒的类，但是你可以根据需要返回结果：

result.to_records()
result.to_dict()
result.to_csv()

等等......

Best flexibility and readability is get using pandas:

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

回复收藏 0 原文

凉城已无爱 2024-08-20 03:19:09

使用 ndimage.mean 来自 scipy。对于这个小数据集，这将比接受的答案快两倍，对于较大的输入，速度大约快 3.5 倍：

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

将产生：

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

编辑：使用 bincount （更快！）

这比小示例输入的接受答案快大约 5 倍，如果您重复数据 100000 次，速度会快 8.5 倍左右：

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

回复收藏 0 原文