较快的Pandas Groupby在两列具有高基数的列中
我有一部分代码,可以在两列上进行groupby
,其中两列包含数千个唯一值。该数据集由数百万行制成。
gp = df.groupby(['col1', 'col2']).size()
结果GP
大约5-600万行,col1
和col2
的值的每个组合。
这很好,但是非常慢(几分钟)。有什么方法可以更快地达到相同的结果?
请注意,我无法简单地将计算在多个内核上传播,因为所有内核已经很忙(类似的groupby
操作在所有内核上并行进行,对于其余的列)。
I have a portion of code that does a groupby
over two columns, the two of which contain thousands of unique values. The dataset is made of several million rows.
gp = df.groupby(['col1', 'col2']).size()
The resulting gp
is approx. 5-6 million rows, one for each combination of values of col1
and col2
.
This works perfectly fine, however it is very slow (several minutes). Is there any way to achieve the same result faster?
Note that I can't resort to simply spread the computation over multiple cores, as all cores are already busy (similar groupby
operations happen in parallel over all cores, for the remaining columns).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论