如何在Polars或Pyarrow中获取字符串列

发布于 2025-01-27 13:56:34 字数 357 浏览 3 评论 0 原文

我有一个带有字符串键列的pandas dataframe / polars dataframe / pyarrow表。您可以假设字符串是随机的。我想根据此密钥列将该数据框架分为n个较小的数据范围。

使用整数列，我只能使用 df1 = df [df.key％n == 1] ， df2 = df [df.key％n == 2] 等等。

我最好猜测您将如何使用字符串列执行此操作是应用哈希函数（例如，将字符串的ASCII值求和）转换为整数列，然后使用模量。

请让我知道在Pandas，Polars或Pyarrow中可以做到的最有效的方法，理想情况下，API中的纯柱状操作。对于我的用例来说，做DF。可能太慢了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梅倚清风 2025-02-03 13:56:34

我对@cbilots的答案有很小的补充。 Polars具有 Hash 表达式，因此计算分区ID将是微不足道的。

如果将其与 partition_by 相结合，则可以以较高的速度创建以下速度分区的分区：

df = pl.DataFrame({
    "strings": ["A", "A", "B", "A"],
    "payload": [1, 2, 3, 4]
})


N = 2
(df.with_columns(
     (pl.col("strings").hash() % N).alias("partition_id")
).partition_by("partition_id"))

[shape: (3, 3)
 ┌─────────┬─────────┬──────────────┐
 │ strings ┆ payload ┆ partition_id │
 │ ---     ┆ ---     ┆ ---          │
 │ str     ┆ i64     ┆ u64          │
 ╞═════════╪═════════╪══════════════╡
 │ A       ┆ 1       ┆ 0            │
 │ A       ┆ 2       ┆ 0            │
 │ A       ┆ 4       ┆ 0            │
 └─────────┴─────────┴──────────────┘,
 shape: (1, 3)
 ┌─────────┬─────────┬──────────────┐
 │ strings ┆ payload ┆ partition_id │
 │ ---     ┆ ---     ┆ ---          │
 │ str     ┆ i64     ┆ u64          │
 ╞═════════╪═════════╪══════════════╡
 │ B       ┆ 3       ┆ 1            │
 └─────────┴─────────┴──────────────┘]

分组和分区的实现将并行完成。

I have a small addition to @cbilots answer. Polars has a hash expression, so computing a partition id would be trivial.

If you combine that with partition_by you can create partitioned at blazing speed with:

df = pl.DataFrame({
    "strings": ["A", "A", "B", "A"],
    "payload": [1, 2, 3, 4]
})


N = 2
(df.with_columns(
     (pl.col("strings").hash() % N).alias("partition_id")
).partition_by("partition_id"))

[shape: (3, 3)
 ┌─────────┬─────────┬──────────────┐
 │ strings ┆ payload ┆ partition_id │
 │ ---     ┆ ---     ┆ ---          │
 │ str     ┆ i64     ┆ u64          │
 ╞═════════╪═════════╪══════════════╡
 │ A       ┆ 1       ┆ 0            │
 │ A       ┆ 2       ┆ 0            │
 │ A       ┆ 4       ┆ 0            │
 └─────────┴─────────┴──────────────┘,
 shape: (1, 3)
 ┌─────────┬─────────┬──────────────┐
 │ strings ┆ payload ┆ partition_id │
 │ ---     ┆ ---     ┆ ---          │
 │ str     ┆ i64     ┆ u64          │
 ╞═════════╪═════════╪══════════════╡
 │ B       ┆ 3       ┆ 1            │
 └─────────┴─────────┴──────────────┘]

The grouping and the materialization of the partitions will be done in parallel.

回复收藏 0 原文

你怎么这么可爱啊 2025-02-03 13:56:34

我会尝试使用查看其在数据集和计算平台上的性能。（请注意，在计算中，我有效地仅选择键字段并在此上运行 hash_rows ）

N = 50
df = df.with_columns(
    pl.lit(df.select('key').hash_rows() % N).alias('hash')
)

我只是在数据集中运行了此数据集，其中有近4900万条记录一个32核系统，并在几秒钟内完成。（我的数据集中的“键”字段是人的姓氏。）

我还应该注意，有一个 partition_by 可能在分区中有帮助的方法。

I would try using hash_rows to see how it performs on your dataset and computing platform. (Note that in the calculation, I'm effectively selecting only the key field and running the hash_rows on that)

N = 50
df = df.with_columns(
    pl.lit(df.select('key').hash_rows() % N).alias('hash')
)

I just ran this on a dataset with almost 49 million records on a 32-core system, and it completed within seconds. (The 'key' field in my dataset was last names of people.)

I should also note, there's a partition_by method that may be of help in the partitioning.

回复收藏 0 原文

~没有更多了~