聚合数组的 pythonic 方式(numpy 或非 numpy)
我想创建一个很好的函数来聚合数组中的数据(它是一个 numpy 记录数组,但它不会改变任何内容)
您有一个想要在一个轴中聚合的数据数组:例如 dtype=[(name, (np.str_,8), (job, (np.str_,8), (venue, np.uint32)]
并且你想知道
我所做的 每份工作的平均收入这个函数,在示例中它应该被称为aggregate(data,'job','venue',mean)
def aggregate(data, key, value, func):
data_per_key = {}
for k,v in zip(data[key], data[value]):
if k not in data_per_key.keys():
data_per_key[k]=[]
data_per_key[k].append(v)
return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
问题是我发现它不太好我想把它放在一行:您有什么想法吗?
谢谢您的回答 Louis
PS:我想将 func 保留在通话中,以便您还可以询问中位数、最小值...
I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)
you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]
and you want to have the mean income per job
I did this function, and in the example it should be called as aggregate(data,'job','income',mean)
def aggregate(data, key, value, func):
data_per_key = {}
for k,v in zip(data[key], data[value]):
if k not in data_per_key.keys():
data_per_key[k]=[]
data_per_key[k].append(v)
return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?
Thanks for your answer Louis
PS: I would like to keep the func in the call so that you can also ask for median, minimum...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
也许您正在寻找的函数是 matplotlib.mlab.rec_groupby :
产生
matplotlib.mlab.rec_groupby
返回一个记录:您可能还有兴趣查看 pandas ,它甚至具有更多功能用于处理分组操作的设施。
Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:
yields
matplotlib.mlab.rec_groupby
returns a recarray:You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.
您的
if k not in data_per_key.keys()
可以重写为if k not in data_per_key
,但您可以使用defaultdict
。这是一个使用defaultdict
来消除存在检查的版本:Your
if k not in data_per_key.keys()
could be rewritten asif k not in data_per_key
, but you can do even better withdefaultdict
. Here's a version that usesdefaultdict
to get rid of the existence check:最好的灵活性和可读性是使用 pandas:
产量:
Pandas DataFrame 是一个很棒的类,但是你可以根据需要返回结果:
等等......
Best flexibility and readability is get using pandas:
Yields to :
Pandas DataFrame is a great class to work with, but you can get back your results as you need:
And so on...
使用 ndimage.mean 来自 scipy。对于这个小数据集,这将比接受的答案快两倍,对于较大的输入,速度大约快 3.5 倍:
将产生:
编辑:使用 bincount (更快!)
这比小示例输入的接受答案快大约 5 倍,如果您重复数据 100000 次,速度会快 8.5 倍左右:
Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:
Will yield to:
EDIT: with bincount (faster!)
This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:
更新 2022:
有一个包可以很好地模拟 matlabs accumarray 的功能。您可以通过
pip install numpy_groupies
安装它或在此处找到它:https:// github.com/ml31415/numpy-groupies
Update 2022:
There is a package which emulates the functionality of matlabs accumarray quite well. You can install it via
pip install numpy_groupies
or find it here:https://github.com/ml31415/numpy-groupies
http://python.net/~ goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method
应该有助于使它更漂亮、更Pythonic、更高效。我稍后会回来检查你的进度。也许您可以在编辑该函数时考虑到这一点?另请参阅接下来的几节。
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method
should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.