按两个值对 rdd 进行排序并获取每组前 10 个
假设我在 pyspark 中有以下 RDD,其中每一行都是一个列表:
[foo, apple]
[foo, orange]
[foo, apple]
[foo, apple]
[foo, grape]
[foo, grape]
[foo, plum]
[bar, orange]
[bar, orange]
[bar, orange]
[bar, grape]
[bar, apple]
[bar, apple]
[bar, plum]
[scrog, apple]
[scrog, apple]
[scrog, orange]
[scrog, orange]
[scrog, grape]
[scrog, plum]
我想显示每个组(索引 0)的前 3 个水果(索引 1),按水果数量排序。假设为了简单起见,不太关心关系(例如,scrog
的 grape
和 plum
计数为 1;不关心是哪个) 。
我的目标是输出如下:
foo, apple, 3
foo, grape, 2
foo, orange, 1
bar, orange, 3
bar, apple, 2
bar, plum, 1 # <------- NOTE: could also be "grape" of count 1
scrog, orange, 2 # <---------- NOTE: "scrog" has many ties, which is okay
scrog, apple, 2
scrog, grape, 1
我可以想到一种可能效率低下的方法:
- 获取唯一的组和
.collect()
作为列表, - 进行计数和排序
- 按组过滤完整的
rdd
,对水果使用 像zipWithIndex()
一样,获取前 3 个计数, - 保存为新 RDD,格式为
(
, , ) - 合并所有 RDD结束
但是我不仅对更多 Spark 特定方法感兴趣,而且对那些可能跳过诸如 collect()
和 zipWithIndex()
等昂贵操作的方法感兴趣。
作为奖励——但不是必需的——如果我确实想应用排序/过滤来解决关系,那么这可能是最好的实现。
任何建议非常感谢!
更新:在我的上下文中,无法使用数据框;必须仅使用 RDD。
Suppose I have the following RDD in pyspark, where each row is a list:
[foo, apple]
[foo, orange]
[foo, apple]
[foo, apple]
[foo, grape]
[foo, grape]
[foo, plum]
[bar, orange]
[bar, orange]
[bar, orange]
[bar, grape]
[bar, apple]
[bar, apple]
[bar, plum]
[scrog, apple]
[scrog, apple]
[scrog, orange]
[scrog, orange]
[scrog, grape]
[scrog, plum]
I would like to show the top 3 fruit (index 1) for each group (index 0), ordered by the count of fruit. Suppose for the sake of simplicity, not caring much about ties (e.g. scrog
has count 1 for grape
and plum
; don't care which).
My goal is output like:
foo, apple, 3
foo, grape, 2
foo, orange, 1
bar, orange, 3
bar, apple, 2
bar, plum, 1 # <------- NOTE: could also be "grape" of count 1
scrog, orange, 2 # <---------- NOTE: "scrog" has many ties, which is okay
scrog, apple, 2
scrog, grape, 1
I can think of a likely inefficient approach:
- get unique groups and
.collect()
as list - filter full
rdd
by group, count and sort fruits - use something like
zipWithIndex()
to get top 3 counts - save as new RDD with format
(<group>, <fruit>, <count>)
- union all RDDs at end
But I'm interested in not only more spark specific approaches, but ones that might skip expensive actions like collect()
and zipWithIndex()
.
As a bonus -- but not required -- if I did want to apply sorting/filtering to address ties, where that might best be accomplished.
Any advice much appreciated!
UPDATE: in my context, unable to use dataframes; must use RDDs only.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
pyspark中的
map
和reduceByKey
操作使用.reduceByKey
对计数求和,使用.groupByKey
对组进行分组,选择每组前 3 个带有.map
和heapq.nlargest
。标准 python
为了进行比较,如果你有一个简单的 python 列表而不是 rdd,那么在 python 中进行分组的最简单方法是使用字典:
map
andreduceByKey
operations in pysparkSum the counts with
.reduceByKey
, group the groups with.groupByKey
, select the top 3 of each group with.map
andheapq.nlargest
.Standard python
For comparison, if you have a simple python list instead of an rdd, the easiest way to do grouping in python is with dictionaries: