如何在 pyspark 中对包含数据帧的数组的行进行相交
我有一个数据框,
df = spark.createDataFrame(
[(2022, 1, 3, '01', ['apple', 'banana', 'orange'],
[['apple', 'edible', 'fruit', 'green'], ['largest', 'herbaceous', 'flowering', 'plant', 'Vitamin B', 'fruit'],
['source', 'Vitamin C', 'fruit']], [['fruit', 2], ['Vitamin', 2]]),
(2022, 1, 3, '02', ['apple', 'banana', 'avocado'],
[['apple', 'edible', 'fruit', 'green'], ['largest', 'herbaceous', 'flowering', 'plant', 'Vitamin B', 'fruit'],
['medium', 'dark', 'green', 'fruit']], [['fruit', 3], ['green', 2]]),
(2022, 2, 4, '03', ['pomelo', 'fig'],
[['citrus', 'fruit', 'sweet'], ['soft', 'sweet']], [['sweet', 2]]), ],
['year', 'month', 'day', 'id', "list_of_fruits",
'collected_tokens', 'most_common_word']
)
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------+
|year|month|day|id |list_of_fruits |collected_tokens |most_common_word |
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------+
|2022|1 |3 |01 |[apple, banana, orange] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit], [source, Vitamin C, fruit]] |[[fruit, 2], [Vitamin, 2]]|
|2022|1 |3 |02 |[apple, banana, avocado]|[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit], [medium, dark, green, fruit]]|[[fruit, 3], [green, 2]] |
|2022|2 |4 |03 |[pomelo, fig] |[[citrus, fruit, sweet], [soft, sweet]] |[[sweet, 2]] |
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------
我想按年、日和月进行分组,并与包含列表、列表列表和带有键和最小值的列表(分别是最后三列)的行相交。最后,我想要这个结果
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
|year|month|day|id |intersection_list_of_fruits|intersection_collected_tokens |intersection_most_common_word|
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
|2022|1 |3 |01 |[apple, banana] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit]]|[[fruit, 2]] |
|2022|1 |3 |02 |[apple, banana] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit]]|[[fruit, 2]] |
|2022|2 |4 |03 |[pomelo, fig] |[[citrus, fruit, sweet], [soft, sweet]] |[[sweet, 2]] |
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
所以在intersection_list_of_fruits
列中缺少[orange],[avocado]
,在intersection_collected_tokens
列中缺少<代码>[来源、维生素 C、水果]、[中等、深色、绿色、水果] 且在 intersection_most_common_word
列中缺失[维生素,2],[绿色,2]
。
我了解 array_intersect
,但我需要查看按行交集,并且还需要使用聚合函数,因为 groupby - 将具有相同日期和时间的 id 分组使它们相交。 (我认为这可以使用spark的applyInPandas函数来完成)
I have a dataframe
df = spark.createDataFrame(
[(2022, 1, 3, '01', ['apple', 'banana', 'orange'],
[['apple', 'edible', 'fruit', 'green'], ['largest', 'herbaceous', 'flowering', 'plant', 'Vitamin B', 'fruit'],
['source', 'Vitamin C', 'fruit']], [['fruit', 2], ['Vitamin', 2]]),
(2022, 1, 3, '02', ['apple', 'banana', 'avocado'],
[['apple', 'edible', 'fruit', 'green'], ['largest', 'herbaceous', 'flowering', 'plant', 'Vitamin B', 'fruit'],
['medium', 'dark', 'green', 'fruit']], [['fruit', 3], ['green', 2]]),
(2022, 2, 4, '03', ['pomelo', 'fig'],
[['citrus', 'fruit', 'sweet'], ['soft', 'sweet']], [['sweet', 2]]), ],
['year', 'month', 'day', 'id', "list_of_fruits",
'collected_tokens', 'most_common_word']
)
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------+
|year|month|day|id |list_of_fruits |collected_tokens |most_common_word |
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------+
|2022|1 |3 |01 |[apple, banana, orange] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit], [source, Vitamin C, fruit]] |[[fruit, 2], [Vitamin, 2]]|
|2022|1 |3 |02 |[apple, banana, avocado]|[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit], [medium, dark, green, fruit]]|[[fruit, 3], [green, 2]] |
|2022|2 |4 |03 |[pomelo, fig] |[[citrus, fruit, sweet], [soft, sweet]] |[[sweet, 2]] |
+----+-----+---+---+------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------
I want to groupby by year, day, and month and intersect rows containing a list, a list of lists and a list with a key and min value (the last three columns respectively). In the end, I would like this result
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
|year|month|day|id |intersection_list_of_fruits|intersection_collected_tokens |intersection_most_common_word|
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
|2022|1 |3 |01 |[apple, banana] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit]]|[[fruit, 2]] |
|2022|1 |3 |02 |[apple, banana] |[[apple, edible, fruit, green], [largest, herbaceous, flowering, plant, Vitamin B, fruit]]|[[fruit, 2]] |
|2022|2 |4 |03 |[pomelo, fig] |[[citrus, fruit, sweet], [soft, sweet]] |[[sweet, 2]] |
+----+-----+---+---+---------------------------+------------------------------------------------------------------------------------------+-----------------------------+
So in the column intersection_list_of_fruits
missing [orange],[avocado]
, in the column intersection_collected_tokens
missing [source, Vitamin C, fruit], [medium, dark, green, fruit]
and in the column intersection_most_common_word
missing [Vitamin, 2], [green, 2]
.
I know about array_intersect
, but I need to look at the intersection by row, and also need to use an aggregation function due to groupby - to group ids with the same date and intersect them. (I think this can be done using spark's applyInPandas function)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用
aggregate
和array_intersect
以及collect_set
来计算list_of_fruits
和collected_tokens<的交集/code> 获取
intersection_list_of_fruits
和intersection_collected_tokens
。但是,由于
intersection_most_common_word
需要考虑单词的数量。为此,most_common_word
中的收集数组并找到最小计数You can use
aggregate
andarray_intersect
, along withcollect_set
to compute the intersection onlist_of_fruits
andcollected_tokens
to obtainintersection_list_of_fruits
andintersection_collected_tokens
.However, since
intersection_most_common_word
needs to account for the count of the words. To do this,most_common_word
and find the minimum count