使用 pandas，如何以有效的方式按组对大型 DataFrame 进行二次采样？

发布于 2024-12-06 14:31:30 字数 1854 浏览 1 评论 0原文

我正在尝试根据分组对 DataFrame 的行进行二次采样。这是一个例子。假设我定义以下数据：

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
                            "c","a","a","a","b","b","b","b"],
                'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
                'value'  : ["apple","pear","orange","apple",
                            "banana","durian","lemon","lime",
                            "raspberry","durian","peach","nectarine",
                            "banana","lemon","guava","blackberry","grape"]})

如果我按 group1 和 group2 进行分组，那么每组中的行数如下：（

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]: 
      count
a  1  2    
   2  1    
   3  2    
   4  1    
b  1  2    
   2  2    
   3  1    
   4  1    
c  3  1    
   4  1    
   5  2    
   6  1

如果有一种更简洁的方法计算一下，请告诉我。）

我现在想要构建一个 DataFrame，它从每组中随机选择一个行。我的建议是这样做：

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]: 
    group1  group2  value
0   b       2       pear 
1   a       1       apple
2   b       2       pear 
3   a       1       apple
4   a       1       apple
5   a       1       apple
6   a       1       apple
7   a       1       apple
8   a       1       apple
9   a       1       apple
10  a       1       apple
11  a       1       apple

这有效。然而，我的真实数据大约有 250 万行和 12 列。如果我通过构建自己的数据结构来以肮脏的方式完成此操作，我可以在几秒钟内完成此操作。然而，我上面的实现并没有在 30 分钟内完成（并且似乎不受内存限制）。附带说明一下，当我尝试在 R 中实现此功能时，我首先尝试了 plyr，它也没有在合理的时间内完成；然而，使用data.table的解决方案很快就完成了。

如何让它与 pandas 一起快速工作？我想要喜欢这个包，所以请帮忙！

原文

I am trying to subsample rows of a DataFrame according to a grouping. Here is an example. Say I define the following data:

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
                            "c","a","a","a","b","b","b","b"],
                'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
                'value'  : ["apple","pear","orange","apple",
                            "banana","durian","lemon","lime",
                            "raspberry","durian","peach","nectarine",
                            "banana","lemon","guava","blackberry","grape"]})

If I group by group1 and group2, then the number of rows in each group is here:

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]: 
      count
a  1  2    
   2  1    
   3  2    
   4  1    
b  1  2    
   2  2    
   3  1    
   4  1    
c  3  1    
   4  1    
   5  2    
   6  1

(If there is an even more concise way to compute that, please tell.)

I now want to construct a DataFrame that has one randomly selected row from each group. My proposal is to do it like so:

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]: 
    group1  group2  value
0   b       2       pear 
1   a       1       apple
2   b       2       pear 
3   a       1       apple
4   a       1       apple
5   a       1       apple
6   a       1       apple
7   a       1       apple
8   a       1       apple
9   a       1       apple
10  a       1       apple
11  a       1       apple

which works. However, my real data has about 2.5 million rows and 12 columns. If I do this the dirty way by building my own data structures, I can complete this operation in a matter of seconds. However, my implementation above does not finish within 30 minutes (and does not appear to be memory-limited). As a side note, when I tried implementing this in R, I first tried plyr, which also did not finish in a reasonable amount of time; however, a solution using data.table finished very rapidly.

How do I get this to work rapidly with pandas? I want to love this package, so please help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

假装不在乎 2024-12-13 14:31:30

我用apply测试过，好像子组很多的时候，速度很慢。 grouped 的 groups 属性是一个字典，您可以直接从中选择索引：

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

编辑：从 pandas 版本 0.18.1 开始，itervalues 不再适用于 groupby 对象 - 您可以只使用 。值：

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]

I tested with apply, it seems that when there are many sub groups, it's very slow. the groups attribute of grouped is a dict, you can choice index directly from it:

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

EDIT: As of pandas version 0.18.1, itervalues no longer works on groupby objects - you can just use .values:

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]

回复收藏 0 原文

~没有更多了~

关于作者

゛时过境迁

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 pandas，如何以有效的方式按组对大型 DataFrame 进行二次采样？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

使用 pandas，如何以有效的方式按组对大型 DataFrame 进行二次采样？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。