pyspark在布尔列中获取真实值的计数
请让我知道这是否是重复的。我一直在搜索到不编写用户定义的函数而尝试找出HWO来做到这一点。我有一堆布尔列,每个布尔列在Pyspark数据框架中都有不同的质量保证标志。我需要做的就是创建一个带有具有真实值的这些列数的新列,QA计数每行失败。但是,我无法确定这样做的有效方法。任何想法,参考或链接都将非常感谢!
例如,对于带有以下值的上述列的一个记录...
< img src =“ https://i.sstatic.net/psa3d.png” alt =“ bollean列记录示例”>
...我想创建一个具有2个数量的新列
。想法?
Please let me know if this is a duplicate. I've been searching all over to try to figure out hwo to do this without writing a user defined function. I have a bunch of boolean columns, each a different quality assurance flag, in a PySpark data frame. All I need to do is create a new column with the number of these columns with a True value, the count of QA checks each row is failing. However, I cannot, for the life of me, figure out an efficient way of doing this. Any ideas, references or links are greatly appreciated!
For instance, for one record with the above columns with the following values...
...I want to create a new column with a count of 2.
Have any good ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在不使用用户定义的功能的情况下,想到了两种方法。
我假设您有一个带有布尔列名称的Python列表。
qa_tests = ['qa_flg_xy_equal','qa_flg_out_of_bounds_x']
等等。计划 - 在本地Python中构建一列,该列是所有布尔柱作为整数施放的总和,然后将其放在火花中。
sum_bools只是一种自动写作lit(0) + col(“ qa_flg_xy_equal”)。cast(“ integer”) + col(“ qa_flg_out_of_bounds_x”)。cast(“ integer”) +
... :
代码的其余部分:
计划B - 我们可以使用数组列将所有布尔人收集到一个数组值中,仅过滤true并检查过滤器后的数组大小。
无需将所有详细的步骤保留在下面,您当然可以用一个用一个详细说明。
Two methods come to mind without using user defined functions .
I'm assuming you have a python list with the boolean column names.
qa_tests = ['qa_flg_xy_equal', 'qa_flg_out_of_bounds_x']
and so forth.plan a - build in local python a column that is the sum of all boolean columns cast as integers and then put it in spark.
sum_bools is just an automatic way of writing lit(0) + col("qa_flg_xy_equal").cast("integer") + col("qa_flg_out_of_bounds_x").cast("integer") + ...
Here is how sum_bools is defnied:
rest of code:
plan b - we can use Array columns to collect all booleans into one array value, filter only the true and check the size of the array after the filter.
No need to keep all the verbose steps below, you can write it in one withColumn of course.