dask DataFrame groupby-agg具有已知部门

发布于 2025-02-02 13:46:05 字数 690 浏览 4 评论 0原文

当在DASK数据框架上运行GroupBy-Agg时，未索引所得的DASK数据帧。如果GroupBy是在单列上运行的，是否可以实现索引的DASK数据框架？

from dask.datasets import timeseries
df = timeseries()
a = df.groupby('name').agg('sum')
print(a.known_divisions)  # False

为了获得索引的数据框，可以做：

b = a.reset_index().set_index('name')
print(b.known_divisions)  # True

但是，.set_index操作将进行调整数据，可以在GroupBy-Agg时避免使用。

是否还有其他一套操作最终会提供汇总的数据框架，以便a.known_divisions == true？

更新：我想到的特定用例是何时已知名称列的唯一值（并且可能有很多独特的值）。例如，假装有一百万个名称，然后获得GroupBy-Agg结果将是很棒的，因此所有名称以“ A”开头的所有名称都在一个分区中，而在第二个分区中则具有“ B”，等等。

原文

When running groupby-agg on a Dask dataframe, the resulting Dask dataframe is not indexed. If the groupby is run on a single column, is it possible to achieve an indexed Dask dataframe?

from dask.datasets import timeseries
df = timeseries()
a = df.groupby('name').agg('sum')
print(a.known_divisions)  # False

To get the indexed dataframe, it's possible to do:

b = a.reset_index().set_index('name')
print(b.known_divisions)  # True

However, the .set_index operation will shuffle data, which might be possible to avoid at the time of groupby-agg.

Is there any other set of operations that will eventually give the aggregated dataframe such that a.known_divisions==True?

Update: the specific use case I have in mind is when the unique values of name column are known (and there could be many, many unique values). For example, pretend there is a million names, then it would be great to get the groupby-agg result, so that all names starting with 'A' are in one partition, with 'B' in second partition, etc.

分享到QQ

分享到微博