如何将Groupby带入Pyspark中具有字符串数据类型的列?
我想分组多个列,但是其中一个具有字符串型值。 在此处发布示例数据集,我正在使用的数据框架有多个列,上面有int
和1 String
类型列。
给定的数据框:
# |Year| Movie|
# +----+----------------+
# |2020| Inception|
# |2018| The Godfather|
# |2018| The Dark Knight|
# |2015| 12 Angry Men|
# |2020|Schindler's List|
# |2015| Pulp Fiction|
# |2018| Fight Club|
必需的数据框:
# |Year|Movie |
# +----+--------------------------------------------+
# |2020|[Inception, Schindler's List] |
# |2018|[The Godfather, The Dark Knight, Fight Club]|
# |2015|[12 Angry Men, Pulp Fiction] |
I want to groupby multiple columns, but one of them has string-type values.
Posting a sample dataset here, the DataFrame I am using has multiple columns with int
and 1 string
type column.
Given DataFrame:
# |Year| Movie|
# +----+----------------+
# |2020| Inception|
# |2018| The Godfather|
# |2018| The Dark Knight|
# |2015| 12 Angry Men|
# |2020|Schindler's List|
# |2015| Pulp Fiction|
# |2018| Fight Club|
Required DataFrame:
# |Year|Movie |
# +----+--------------------------------------------+
# |2020|[Inception, Schindler's List] |
# |2018|[The Godfather, The Dark Knight, Fight Club]|
# |2015|[12 Angry Men, Pulp Fiction] |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以按年进行分组,并使用collect_set将所有项目分组:
you can group by year and use collect_set to group all items in lists: