选定列中值的独特组合计数
我有一个看起来像这样的Pyspark数据框架:
----------------------------
id A B C
id1 on on on
id1 on off on
id1 on on on
id1 on on on
id1 on on off
-----------------------------
我正在寻找一种找到所选列的所有唯一组合并显示其计数的方法。 The expected output:
----------------------------
id A B C count
id1 on on on 3
id1 on off on 1
id1 on on off 1
-----------------------------
I see that there is
UPD: 另外,请注意,A和B列的独特组合与A,B,c的组合不同。我想要每一列的所有可能组合。有没有办法实现它,而不是通过并计算一种组合,另一种组合等?还有10列。
I have a PySpark data frame that looks like this:
----------------------------
id A B C
id1 on on on
id1 on off on
id1 on on on
id1 on on on
id1 on on off
-----------------------------
I am looking for a way to find all unique combinations for selected columns and show their count.
The expected output:
----------------------------
id A B C count
id1 on on on 3
id1 on off on 1
id1 on on off 1
-----------------------------
I see that there is a way to do a similar operation in Pandas, but I need PySpark.
UPD:
Also, please note that a unique combination of columns A and B is not the same as a combination of A,B,C. I want all possible combination of every column. Is there a way to achieve it rather than grouping by and counting one combination, another combination, etc.? There are more that 10 columns.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
cube
可以做到。但它显示了所有组合,包括是否没有考虑某些列。因此,您将必须过滤。这将计算出仅在指定列中的事件:
cube
can do it. But it displays ALL combinations including if some columns were not taken into account. So you will have to filter afterwards.This would count occurrences just in specified columns:
解决方案:
除了立方体功能外,我们还具有汇总功能。
Cube :它获取列列表,并将汇总表达式应用于分组列的所有可能组合。
汇总:与 Cube IS crolup 相似的功能,该功能从左到右计算层次的小写。
通过组进行汇总(...)与立方体相似,但在层次上起作用,通过从左到右填充列。
Solution:
Apart from Cube function, we also have Rollup function.
cube: It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns.
rollup: A similar function to cube is rollup which computes hierarchical subtotals from left to right.
With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling columns from left to right.