Pyspark:按多列分组并计算组数

发布于 2025-01-11 04:22:04 字数 799 浏览 1 评论 0原文

我有一个像这样的数据框:

id Name Rank Course
1  S1   21   Physics
2  S2   22   Chemistry
3  S3   24   Math
4  S2   22   English
5  S2   22   Social
6  S1   21   Geography

我想根据名称、排名对这个数据集进行分组并计算组数。在 pandas 中,我可以轻松做到:

df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()

计算上述内容后,我得到以下输出:

id Name Rank Course     ngrp
1  S1   21   Physics    0
6  S1   22   Geography  0
2  S2   22   Chemistry  1
4  S2   22   English    1
5  S2   23   Social     1
3  S3   24   Math       2

Pyspark 中是否有一种方法可以实现相同的输出?我尝试了以下方法,但似乎不起作用:

from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()

I have a dataframe like:

id Name Rank Course
1  S1   21   Physics
2  S2   22   Chemistry
3  S3   24   Math
4  S2   22   English
5  S2   22   Social
6  S1   21   Geography

I want to group this dataset over Name, Rank and calculate group number. In pandas, I can easily do:

df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()

After computing the above, I get the following output:

id Name Rank Course     ngrp
1  S1   21   Physics    0
6  S1   22   Geography  0
2  S2   22   Chemistry  1
4  S2   22   English    1
5  S2   23   Social     1
3  S3   24   Math       2

Is there a method in Pyspark that will achieve the same output? I tried the following, but it doesn't seem to work:

from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

清君侧 2025-01-18 04:22:04

您可以选择 DENSE_RANK -

数据准备

df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')

sparkDF = sql.createDataFrame(df)

sparkDF.show()
+---+----+----+---------+
| id|Name|Rank|   Course|
+---+----+----+---------+
|  1|  S1|  21|  Physics|
|  2|  S2|  22|Chemistry|
|  3|  S3|  24|     Math|
|  4|  S2|  22|  English|
|  5|  S2|  22|   Social|
|  6|  S1|  21|Geography|
+---+----+----+---------+

Dense Rank

window = Window.orderBy(['Name','Rank'])

sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)

sparkDF.orderBy(['Name','ngroup']).show()

+---+----+----+---------+------+
| id|Name|Rank|   Course|ngroup|
+---+----+----+---------+------+
|  6|  S1|  21|Geography|     0|
|  1|  S1|  21|  Physics|     0|
|  4|  S2|  22|  English|     1|
|  2|  S2|  22|Chemistry|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

Dense Rank - SparkSQL

sql.sql("""
SELECT
    ID,
    NAME,
    RANK,
    COURSE,
    DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()

+---+----+----+---------+------+
| ID|NAME|RANK|   COURSE|NGROUP|
+---+----+----+---------+------+
|  1|  S1|  21|  Physics|     0|
|  6|  S1|  21|Geography|     0|
|  2|  S2|  22|Chemistry|     1|
|  4|  S2|  22|  English|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

You can opt for DENSE_RANK -

Data Preparation

df = pd.read_csv(StringIO("""
id,Name,Rank,Course
1,S1,21,Physics
2,S2,22,Chemistry
3,S3,24,Math
4,S2,22,English
5,S2,22,Social
6,S1,21,Geography
"""),delimiter=',')

sparkDF = sql.createDataFrame(df)

sparkDF.show()
+---+----+----+---------+
| id|Name|Rank|   Course|
+---+----+----+---------+
|  1|  S1|  21|  Physics|
|  2|  S2|  22|Chemistry|
|  3|  S3|  24|     Math|
|  4|  S2|  22|  English|
|  5|  S2|  22|   Social|
|  6|  S1|  21|Geography|
+---+----+----+---------+

Dense Rank

window = Window.orderBy(['Name','Rank'])

sparkDF = sparkDF.withColumn('ngroup',F.dense_rank().over(window) - 1)

sparkDF.orderBy(['Name','ngroup']).show()

+---+----+----+---------+------+
| id|Name|Rank|   Course|ngroup|
+---+----+----+---------+------+
|  6|  S1|  21|Geography|     0|
|  1|  S1|  21|  Physics|     0|
|  4|  S2|  22|  English|     1|
|  2|  S2|  22|Chemistry|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

Dense Rank - SparkSQL

sql.sql("""
SELECT
    ID,
    NAME,
    RANK,
    COURSE,
    DENSE_RANK() OVER(ORDER BY NAME,RANK) - 1 as NGROUP
FROM TB1
""").show()

+---+----+----+---------+------+
| ID|NAME|RANK|   COURSE|NGROUP|
+---+----+----+---------+------+
|  1|  S1|  21|  Physics|     0|
|  6|  S1|  21|Geography|     0|
|  2|  S2|  22|Chemistry|     1|
|  4|  S2|  22|  English|     1|
|  5|  S2|  22|   Social|     1|
|  3|  S3|  24|     Math|     2|
+---+----+----+---------+------+

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文