dask DataFrame groupby-agg具有已知部门
当在DASK数据框架上运行GroupBy-Agg时,未索引所得的DASK数据帧。如果GroupBy是在单列上运行的,是否可以实现索引的DASK数据框架?
from dask.datasets import timeseries
df = timeseries()
a = df.groupby('name').agg('sum')
print(a.known_divisions) # False
为了获得索引的数据框,可以做:
b = a.reset_index().set_index('name')
print(b.known_divisions) # True
但是,.set_index
操作将进行调整数据,可以在GroupBy-Agg时避免使用。
是否还有其他一套操作最终会提供汇总的数据框架,以便a.known_divisions == true
?
更新:我想到的特定用例是何时已知名称列的唯一值(并且可能有很多独特的值)。例如,假装有一百万个名称,然后获得GroupBy-Agg结果将是很棒的,因此所有名称以“ A”开头的所有名称都在一个分区中,而在第二个分区中则具有“ B”,等等。
When running groupby-agg on a Dask dataframe, the resulting Dask dataframe is not indexed. If the groupby is run on a single column, is it possible to achieve an indexed Dask dataframe?
from dask.datasets import timeseries
df = timeseries()
a = df.groupby('name').agg('sum')
print(a.known_divisions) # False
To get the indexed dataframe, it's possible to do:
b = a.reset_index().set_index('name')
print(b.known_divisions) # True
However, the .set_index
operation will shuffle data, which might be possible to avoid at the time of groupby-agg.
Is there any other set of operations that will eventually give the aggregated dataframe such that a.known_divisions==True
?
Update: the specific use case I have in mind is when the unique values of name column are known (and there could be many, many unique values). For example, pretend there is a million names, then it would be great to get the groupby-agg result, so that all names starting with 'A' are in one partition, with 'B' in second partition, etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论