如何将Groupby带入Pyspark中具有字符串数据类型的列？

发布于 2025-02-11 01:00:45 字数 745 浏览 1 评论 0原文

我想分组多个列，但是其中一个具有字符串型值。在此处发布示例数据集，我正在使用的数据框架有多个列，上面有int和1 String类型列。

给定的数据框：

    # |Year|           Movie|
    # +----+----------------+
    # |2020|       Inception|
    # |2018|   The Godfather|
    # |2018| The Dark Knight|
    # |2015|    12 Angry Men|
    # |2020|Schindler's List|
    # |2015|    Pulp Fiction|
    # |2018|      Fight Club|

必需的数据框：

    # |Year|Movie                                       |
    # +----+--------------------------------------------+
    # |2020|[Inception, Schindler's List]               |
    # |2018|[The Godfather, The Dark Knight, Fight Club]|
    # |2015|[12 Angry Men, Pulp Fiction]                |

原文

I want to groupby multiple columns, but one of them has string-type values.
Posting a sample dataset here, the DataFrame I am using has multiple columns with int and 1 string type column.

Given DataFrame:

    # |Year|           Movie|
    # +----+----------------+
    # |2020|       Inception|
    # |2018|   The Godfather|
    # |2018| The Dark Knight|
    # |2015|    12 Angry Men|
    # |2020|Schindler's List|
    # |2015|    Pulp Fiction|
    # |2018|      Fight Club|

Required DataFrame:

    # |Year|Movie                                       |
    # +----+--------------------------------------------+
    # |2020|[Inception, Schindler's List]               |
    # |2018|[The Godfather, The Dark Knight, Fight Club]|
    # |2015|[12 Angry Men, Pulp Fiction]                |

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

满意归宿 2025-02-18 01:00:45

您可以按年进行分组，并使用collect_set将所有项目分组：

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([
    ('2020','Inception'),
    ('2018','The Godfather'),
    ('2018','The Dark Knight'),
    ('2015','12 Angry Men'),
    ('2020','Schindlers List'),
    ('2015','Pulp Fiction'),
    ('2018','Fight Club')
     ], ["Year", "Movie"])\
    .withColumn('Year', F.col('Year').cast('integer'))\
    .withColumn('Movie', F.col('Movie').cast('string'))
    
# +----+---------------+
# |Year|          Movie|
# +----+---------------+
# |2020|      Inception|
# |2018|  The Godfather|
# |2018|The Dark Knight|
# |2015|   12 Angry Men|
# |2020|Schindlers List|
# |2015|   Pulp Fiction|
# |2018|     Fight Club|
# +----+---------------+

df\
    .groupby("Year")\
    .agg(F.collect_set("Movie"))\
    .show(truncate=False)

# +----+--------------------------------------------+
# |Year|collect_set(Movie)                          |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men]                |
# |2020|[Schindlers List, Inception]                |
# +----+--------------------------------------------+

you can group by year and use collect_set to group all items in lists:

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([
    ('2020','Inception'),
    ('2018','The Godfather'),
    ('2018','The Dark Knight'),
    ('2015','12 Angry Men'),
    ('2020','Schindlers List'),
    ('2015','Pulp Fiction'),
    ('2018','Fight Club')
     ], ["Year", "Movie"])\
    .withColumn('Year', F.col('Year').cast('integer'))\
    .withColumn('Movie', F.col('Movie').cast('string'))
    
# +----+---------------+
# |Year|          Movie|
# +----+---------------+
# |2020|      Inception|
# |2018|  The Godfather|
# |2018|The Dark Knight|
# |2015|   12 Angry Men|
# |2020|Schindlers List|
# |2015|   Pulp Fiction|
# |2018|     Fight Club|
# +----+---------------+

df\
    .groupby("Year")\
    .agg(F.collect_set("Movie"))\
    .show(truncate=False)

# +----+--------------------------------------------+
# |Year|collect_set(Movie)                          |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men]                |
# |2020|[Schindlers List, Inception]                |
# +----+--------------------------------------------+

回复收藏 0 原文

~没有更多了~