当前位置：文江博客话题详情

rank python-polars

如何用极点替换Pandas DF.Rank（Axis = 1）

发布于 2025-01-22 04:29:16 字数 191 浏览 5 评论 0 原文

Alpha因素有时需要排名，因此：

import pandas as pd

df = pd.Dataframe(some_data)

df.rank(axis=1, pct=True)

如何有效地使用Polars实施此部分？

原文

Alpha factors need section rank sometimes, like this:

import pandas as pd

df = pd.Dataframe(some_data)

df.rank(axis=1, pct=True)

how to implement this with polars efficiently?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱格式化 2025-01-29 04:29:16

PORARS数据框的属性为：

列由均匀数据组成（例如，每个列都是单一类型）。
行由异源数据组成（例如，一行的数据类型可能有所不同）。

因此，Polars不需要Pandas的 Axis = 1 API。计算数字，字符串，布尔值甚至更复杂的嵌套类型（例如结构和列表）之间的等级是没有多大意义的。

Pandas通过给您 numeric_only 关键字参数来解决此问题。

Polars'更加有用，并希望您使用 expression api 。

表达式

杆表达式在具有保证它们由均匀数据的列上工作。列具有此保证，排在 dataFrame 中的行不多。幸运的是，我们有一个数据类型，可以保证行是同质的： pl.list 数据类型。

假设我们有以下数据：

grades = pl.DataFrame({
    "student": ["bas", "laura", "tim", "jenny"],
    "arithmetic": [10, 5, 6, 8],
    "biology": [4, 6, 2, 7],
    "geography": [8, 4, 9, 7]
})
print(grades)

shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ ---     ┆ ---        ┆ ---     ┆ ---       │
│ str     ┆ i64        ┆ i64     ┆ i64       │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         │
│ laura   ┆ 5          ┆ 6       ┆ 4         │
│ tim     ┆ 6          ┆ 2       ┆ 9         │
│ jenny   ┆ 8          ┆ 7       ┆ 7         │
└─────────┴────────────┴─────────┴───────────┘

如果我们要计算所有列的等级，则除 student 外，我们可以将它们收集到 list 数据类型：

这将给出：

grades.select(
    pl.concat_list(pl.exclude("student")).alias("all_grades")
)

shape: (4, 1)
┌────────────┐
│ all_grades │
│ ---        │
│ list[i64]  │
╞════════════╡
│ [10, 4, 8] │
│ [5, 6, 4]  │
│ [6, 2, 9]  │
│ [8, 7, 7]  │
└────────────┘

在列表元素上运行PORRARS表达式

我们可以运行 Polars表达式在列表元素上使用

Polars不能提供关键字参数计算排名的百分比。但是，由于表达方式如此通用，我们可以创建自己的百分比等级表达式。

请注意，我们使用 element（）是指在 list.eval 表达式中评估的元素。

# the percentage rank expression
rank_pct = pl.element().rank(descending=True) / pl.element().count()

grades.with_columns(
    pl.concat_list(pl.exclude("student"))
      .list.eval(rank_pct).alias("grades_rank")
)

输出：

shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank                    │
│ ---     ┆ ---        ┆ ---     ┆ ---       ┆ ---                            │
│ str     ┆ i64        ┆ i64     ┆ i64       ┆ list[f64]                      │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         ┆ [0.333333, 1.0, 0.666667]      │
│ laura   ┆ 5          ┆ 6       ┆ 4         ┆ [0.666667, 0.333333, 1.0]      │
│ tim     ┆ 6          ┆ 2       ┆ 9         ┆ [0.666667, 1.0, 0.333333]      │
│ jenny   ┆ 8          ┆ 7       ┆ 7         ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘

请注意，此解决方案适用于您要进行行明智的任何表达式/操作。

A polars DataFrame has properties being:

columns consist of homogeneous data (e.g. every column is a single type).
rows consist of heterogenous data (e.g. data types on a row may differ).

For this reason polars does not want the axis=1 API from pandas. It does not make much sense to compute the rank between numeric, string, boolean or even complexer nested types like structs and lists.

Pandas solves this by giving you numeric_only keyword argument.

Polars' is more opinionated and wants to nudge you in using the expression API.

Expression

Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.

Let's say we have the following data:

grades = pl.DataFrame({
    "student": ["bas", "laura", "tim", "jenny"],
    "arithmetic": [10, 5, 6, 8],
    "biology": [4, 6, 2, 7],
    "geography": [8, 4, 9, 7]
})
print(grades)

shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ ---     ┆ ---        ┆ ---     ┆ ---       │
│ str     ┆ i64        ┆ i64     ┆ i64       │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         │
│ laura   ┆ 5          ┆ 6       ┆ 4         │
│ tim     ┆ 6          ┆ 2       ┆ 9         │
│ jenny   ┆ 8          ┆ 7       ┆ 7         │
└─────────┴────────────┴─────────┴───────────┘

If we want to compute the rank of all the columns except for student, we can collect those into a list data type:

This would give:

grades.select(
    pl.concat_list(pl.exclude("student")).alias("all_grades")
)

shape: (4, 1)
┌────────────┐
│ all_grades │
│ ---        │
│ list[i64]  │
╞════════════╡
│ [10, 4, 8] │
│ [5, 6, 4]  │
│ [6, 2, 9]  │
│ [8, 7, 7]  │
└────────────┘

Running polars expression on list elements

We can run any polars expression on the elements of a list with the list.eval expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.

Polars doesn't provide a keyword argument the compute the percentages of the ranks. But because expressions are so versatile we can create our own percentage rank expression.

Note that we use element() which refers to the element being evaluated in a list.eval expression.

# the percentage rank expression
rank_pct = pl.element().rank(descending=True) / pl.element().count()

grades.with_columns(
    pl.concat_list(pl.exclude("student"))
      .list.eval(rank_pct).alias("grades_rank")
)

This outputs:

shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank                    │
│ ---     ┆ ---        ┆ ---     ┆ ---       ┆ ---                            │
│ str     ┆ i64        ┆ i64     ┆ i64       ┆ list[f64]                      │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         ┆ [0.333333, 1.0, 0.666667]      │
│ laura   ┆ 5          ┆ 6       ┆ 4         ┆ [0.666667, 0.333333, 1.0]      │
│ tim     ┆ 6          ┆ 2       ┆ 9         ┆ [0.666667, 1.0, 0.333333]      │
│ jenny   ┆ 8          ┆ 7       ┆ 7         ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘

Note that this solution works for any expressions/operation you want to do row wise.

回复收藏 0 原文

~没有更多了~