较快的Pandas Groupby在两列具有高基数的列中

发布于 2025-02-04 19:35:53 字数 386 浏览 2 评论 0原文

我有一部分代码，可以在两列上进行groupby，其中两列包含数千个唯一值。该数据集由数百万行制成。

gp = df.groupby(['col1', 'col2']).size()

结果GP大约5-600万行，col1和col2的值的每个组合。

这很好，但是非常慢（几分钟）。有什么方法可以更快地达到相同的结果？

请注意，我无法简单地将计算在多个内核上传播，因为所有内核已经很忙（类似的groupby操作在所有内核上并行进行，对于其余的列）。

原文

I have a portion of code that does a groupby over two columns, the two of which contain thousands of unique values. The dataset is made of several million rows.

gp = df.groupby(['col1', 'col2']).size()

The resulting gp is approx. 5-6 million rows, one for each combination of values of col1 and col2.

This works perfectly fine, however it is very slow (several minutes). Is there any way to achieve the same result faster?

Note that I can't resort to simply spread the computation over multiple cores, as all cores are already busy (similar groupby operations happen in parallel over all cores, for the remaining columns).

分享到QQ

分享到微博