使用 pandas,如何以有效的方式按组对大型 DataFrame 进行二次采样?

发布于 2024-12-06 14:31:30 字数 1854 浏览 1 评论 0原文

我正在尝试根据分组对 DataFrame 的行进行二次采样。这是一个例子。假设我定义以下数据:

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
                            "c","a","a","a","b","b","b","b"],
                'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
                'value'  : ["apple","pear","orange","apple",
                            "banana","durian","lemon","lime",
                            "raspberry","durian","peach","nectarine",
                            "banana","lemon","guava","blackberry","grape"]})

如果我按 group1group2 进行分组,那么每组中的行数如下:(

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]: 
      count
a  1  2    
   2  1    
   3  2    
   4  1    
b  1  2    
   2  2    
   3  1    
   4  1    
c  3  1    
   4  1    
   5  2    
   6  1    

如果有一种更简洁的方法计算一下,请告诉我。)

我现在想要构建一个 DataFrame,它从每组中随机选择一个行。我的建议是这样做:

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]: 
    group1  group2  value
0   b       2       pear 
1   a       1       apple
2   b       2       pear 
3   a       1       apple
4   a       1       apple
5   a       1       apple
6   a       1       apple
7   a       1       apple
8   a       1       apple
9   a       1       apple
10  a       1       apple
11  a       1       apple

这有效。然而,我的真实数据大约有 250 万行和 12 列。如果我通过构建自己的数据结构来以肮脏的方式完成此操作,我可以在几秒钟内完成此操作。然而,我上面的实现并没有在 30 分钟内完成(并且似乎不受内存限制)。附带说明一下,当我尝试在 R 中实现此功能时,我首先尝试了 plyr,它也没有在合理的时间内完成;然而,使用data.table的解决方案很快就完成了。

如何让它与 pandas 一起快速工作?我想要喜欢这个包,所以请帮忙!

I am trying to subsample rows of a DataFrame according to a grouping. Here is an example. Say I define the following data:

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
                            "c","a","a","a","b","b","b","b"],
                'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
                'value'  : ["apple","pear","orange","apple",
                            "banana","durian","lemon","lime",
                            "raspberry","durian","peach","nectarine",
                            "banana","lemon","guava","blackberry","grape"]})

If I group by group1 and group2, then the number of rows in each group is here:

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]: 
      count
a  1  2    
   2  1    
   3  2    
   4  1    
b  1  2    
   2  2    
   3  1    
   4  1    
c  3  1    
   4  1    
   5  2    
   6  1    

(If there is an even more concise way to compute that, please tell.)

I now want to construct a DataFrame that has one randomly selected row from each group. My proposal is to do it like so:

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]: 
    group1  group2  value
0   b       2       pear 
1   a       1       apple
2   b       2       pear 
3   a       1       apple
4   a       1       apple
5   a       1       apple
6   a       1       apple
7   a       1       apple
8   a       1       apple
9   a       1       apple
10  a       1       apple
11  a       1       apple

which works. However, my real data has about 2.5 million rows and 12 columns. If I do this the dirty way by building my own data structures, I can complete this operation in a matter of seconds. However, my implementation above does not finish within 30 minutes (and does not appear to be memory-limited). As a side note, when I tried implementing this in R, I first tried plyr, which also did not finish in a reasonable amount of time; however, a solution using data.table finished very rapidly.

How do I get this to work rapidly with pandas? I want to love this package, so please help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

假装不在乎 2024-12-13 14:31:30

我用apply测试过,好像子组很多的时候,速度很慢。 grouped 的 groups 属性是一个字典,您可以直接从中选择索引:

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

编辑:从 pandas 版本 0.18.1 开始,itervalues 不再适用于 groupby 对象 - 您可以只使用 。值

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]

I tested with apply, it seems that when there are many sub groups, it's very slow. the groups attribute of grouped is a dict, you can choice index directly from it:

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

EDIT: As of pandas version 0.18.1, itervalues no longer works on groupby objects - you can just use .values:

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文