如何将Groupby带入Pyspark中具有字符串数据类型的列?

发布于 2025-02-11 01:00:45 字数 745 浏览 1 评论 0原文

我想分组多个列,但是其中一个具有字符串型值。 在此处发布示例数据集,我正在使用的数据框架有多个列,上面有int和1 String类型列。

给定的数据框:

    # |Year|           Movie|
    # +----+----------------+
    # |2020|       Inception|
    # |2018|   The Godfather|
    # |2018| The Dark Knight|
    # |2015|    12 Angry Men|
    # |2020|Schindler's List|
    # |2015|    Pulp Fiction|
    # |2018|      Fight Club|

必需的数据框:

    # |Year|Movie                                       |
    # +----+--------------------------------------------+
    # |2020|[Inception, Schindler's List]               |
    # |2018|[The Godfather, The Dark Knight, Fight Club]|
    # |2015|[12 Angry Men, Pulp Fiction]                |

I want to groupby multiple columns, but one of them has string-type values.
Posting a sample dataset here, the DataFrame I am using has multiple columns with int and 1 string type column.

Given DataFrame:

    # |Year|           Movie|
    # +----+----------------+
    # |2020|       Inception|
    # |2018|   The Godfather|
    # |2018| The Dark Knight|
    # |2015|    12 Angry Men|
    # |2020|Schindler's List|
    # |2015|    Pulp Fiction|
    # |2018|      Fight Club|

Required DataFrame:

    # |Year|Movie                                       |
    # +----+--------------------------------------------+
    # |2020|[Inception, Schindler's List]               |
    # |2018|[The Godfather, The Dark Knight, Fight Club]|
    # |2015|[12 Angry Men, Pulp Fiction]                |

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

满意归宿 2025-02-18 01:00:45

您可以按年进行分组,并使用collect_set将所有项目分组:

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([
    ('2020','Inception'),
    ('2018','The Godfather'),
    ('2018','The Dark Knight'),
    ('2015','12 Angry Men'),
    ('2020','Schindlers List'),
    ('2015','Pulp Fiction'),
    ('2018','Fight Club')
     ], ["Year", "Movie"])\
    .withColumn('Year', F.col('Year').cast('integer'))\
    .withColumn('Movie', F.col('Movie').cast('string'))
    
# +----+---------------+
# |Year|          Movie|
# +----+---------------+
# |2020|      Inception|
# |2018|  The Godfather|
# |2018|The Dark Knight|
# |2015|   12 Angry Men|
# |2020|Schindlers List|
# |2015|   Pulp Fiction|
# |2018|     Fight Club|
# +----+---------------+

df\
    .groupby("Year")\
    .agg(F.collect_set("Movie"))\
    .show(truncate=False)

# +----+--------------------------------------------+
# |Year|collect_set(Movie)                          |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men]                |
# |2020|[Schindlers List, Inception]                |
# +----+--------------------------------------------+

you can group by year and use collect_set to group all items in lists:

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([
    ('2020','Inception'),
    ('2018','The Godfather'),
    ('2018','The Dark Knight'),
    ('2015','12 Angry Men'),
    ('2020','Schindlers List'),
    ('2015','Pulp Fiction'),
    ('2018','Fight Club')
     ], ["Year", "Movie"])\
    .withColumn('Year', F.col('Year').cast('integer'))\
    .withColumn('Movie', F.col('Movie').cast('string'))
    
# +----+---------------+
# |Year|          Movie|
# +----+---------------+
# |2020|      Inception|
# |2018|  The Godfather|
# |2018|The Dark Knight|
# |2015|   12 Angry Men|
# |2020|Schindlers List|
# |2015|   Pulp Fiction|
# |2018|     Fight Club|
# +----+---------------+

df\
    .groupby("Year")\
    .agg(F.collect_set("Movie"))\
    .show(truncate=False)

# +----+--------------------------------------------+
# |Year|collect_set(Movie)                          |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men]                |
# |2020|[Schindlers List, Inception]                |
# +----+--------------------------------------------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文