聚合数组的 pythonic 方式(numpy 或非 numpy)

发布于 2024-08-13 03:19:09 字数 690 浏览 6 评论 0原文

我想创建一个很好的函数来聚合数组中的数据(它是一个 numpy 记录数组,但它不会改变任何内容)

您有一个想要在一个轴中聚合的数据数组:例如 dtype=[(name, (np.str_,8), (job, (np.str_,8), (venue, np.uint32)] 并且你想知道

我所做的 每份工作的平均收入这个函数,在示例中它应该被称为aggregate(data,'job','venue',mean)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

问题是我发现它不太好我想把它放在一行:您有什么想法吗?

谢谢您的回答 Louis

PS:我想将 func 保留在通话中,以便您还可以询问中位数、最小值...

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

ι不睡觉的鱼゛ 2024-08-20 03:19:09

也许您正在寻找的函数是 matplotlib.mlab.rec_groupby

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

产生

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby 返回一个记录:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

您可能还有兴趣查看 pandas ,它甚至具有更多功能用于处理分组操作的设施。

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.

一曲爱恨情仇 2024-08-20 03:19:09

您的 if k not in data_per_key.keys() 可以重写为 if k not in data_per_key,但您可以使用 defaultdict。这是一个使用 defaultdict 来消除存在检查的版本:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
挽梦忆笙歌 2024-08-20 03:19:09

最好的灵活性和可读性是使用 pandas

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

产量:

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame 是一个很棒的类,但是你可以根据需要返回结果:

result.to_records()
result.to_dict()
result.to_csv()

等等......

Best flexibility and readability is get using pandas:

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

凉城已无爱 2024-08-20 03:19:09

使用 ndimage.mean 来自 scipy。对于这个小数据集,这将比接受的答案快两倍,对于较大的输入,速度大约快 3.5 倍:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

将产生:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

编辑:使用 bincount (更快!)

这比小示例输入的接受答案快大约 5 倍,如果您重复数据 100000 次,速度会快 8.5 倍左右:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])
有木有妳兜一样 2024-08-20 03:19:09

更新 2022:

有一个包可以很好地模拟 matlabs accumarray 的功能。您可以通过 pip install numpy_groupies 安装它或在此处找到它:

https:// github.com/ml31415/numpy-groupies

Update 2022:

There is a package which emulates the functionality of matlabs accumarray quite well. You can install it via pip install numpy_groupies or find it here:

https://github.com/ml31415/numpy-groupies

青衫负雪 2024-08-20 03:19:09

http://python.net/~ goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

应该有助于使它更漂亮、更Pythonic、更高效。我稍后会回来检查你的进度。也许您可以在编辑该函数时考虑到这一点?另请参阅接下来的几节。

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文