我有一个带有字符串键列的pandas dataframe / polars dataframe / pyarrow表。您可以假设字符串是随机的。我想根据此密钥列将该数据框架分为n个较小的数据范围。
使用整数列,我只能使用 df1 = df [df.key%n == 1]
, df2 = df [df.key%n == 2]
等等。
我最好猜测您将如何使用字符串列执行此操作是应用哈希函数(例如,将字符串的ASCII值求和)转换为整数列,然后使用模量。
请让我知道在Pandas,Polars或Pyarrow中可以做到的最有效的方法,理想情况下,API中的纯柱状操作。对于我的用例来说,做DF。可能太慢了。
I have a Pandas DataFrame/Polars dataframe / Pyarrow table with a string key column. You can assume the strings are random. I want to partition that dataframe into N smaller dataframes based on this key column.
With an integer column, I can just use df1 = df[df.key % N == 1]
, df2 = df[df.key % N == 2]
etc.
My best guess at how you are going to do that with a string column is apply a hash function (e.g. summing the ascii values of the string) to convert it to an integer column, then use the modulus.
Please let me know what's the most efficient way this can be done in either Pandas, Polars or Pyarrow, ideally with pure columnar operations within the API. Doing a df.apply is likely too slow for my use case.
发布评论
评论(2)
我对@cbilots的答案有很小的补充。 Polars具有
Hash
表达式,因此计算分区ID将是微不足道的。如果将其与
partition_by
相结合,则可以以较高的速度创建以下速度分区的分区:分组和分区的实现将并行完成。
I have a small addition to @cbilots answer. Polars has a
hash
expression, so computing a partition id would be trivial.If you combine that with
partition_by
you can create partitioned at blazing speed with:The grouping and the materialization of the partitions will be done in parallel.
我会尝试使用 查看其在数据集和计算平台上的性能。 (请注意,在计算中,我有效地仅选择
键
字段并在此上运行hash_rows
)我只是在数据集中运行了此数据集,其中有近4900万条记录一个32核系统,并在几秒钟内完成。 (我的数据集中的“键”字段是人的姓氏。)
我还应该注意,有一个
partition_by
可能在分区中有帮助的方法。I would try using
hash_rows
to see how it performs on your dataset and computing platform. (Note that in the calculation, I'm effectively selecting only thekey
field and running thehash_rows
on that)I just ran this on a dataset with almost 49 million records on a 32-core system, and it completed within seconds. (The 'key' field in my dataset was last names of people.)
I should also note, there's a
partition_by
method that may be of help in the partitioning.