Pyspark:按多列分组并计算组数
我有一个像这样的数据框:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
我想根据名称、排名对这个数据集进行分组并计算组数。在 pandas 中,我可以轻松做到:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
计算上述内容后,我得到以下输出:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Pyspark 中是否有一种方法可以实现相同的输出?我尝试了以下方法,但似乎不起作用:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
I have a dataframe like:
id Name Rank Course
1 S1 21 Physics
2 S2 22 Chemistry
3 S3 24 Math
4 S2 22 English
5 S2 22 Social
6 S1 21 Geography
I want to group this dataset over Name, Rank and calculate group number. In pandas, I can easily do:
df['ngrp'] = df.groupby(['Name', 'Rank']).ngroup()
After computing the above, I get the following output:
id Name Rank Course ngrp
1 S1 21 Physics 0
6 S1 22 Geography 0
2 S2 22 Chemistry 1
4 S2 22 English 1
5 S2 23 Social 1
3 S3 24 Math 2
Is there a method in Pyspark that will achieve the same output? I tried the following, but it doesn't seem to work:
from pyspark.sql import Window
w = Window.partitionBy(['Name', 'Rank'])
df.select(['Name', 'Rank'], ['Course'], f.count(['Name', 'Rank']).over(w).alias('ngroup')).show()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以选择 DENSE_RANK -
数据准备
Dense Rank
Dense Rank - SparkSQL
You can opt for DENSE_RANK -
Data Preparation
Dense Rank
Dense Rank - SparkSQL