如果组中存在非障碍,如何删除重复和零?
应根据 flag 列过滤后面的数据框。如果基于列的组 id 和 cod 没有任何 none 的价值的行,则有必要仅维护独特的行,否则,有必要在 flag 中删除无值的行。
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number,max
spark = SparkSession.builder.appName('Vazio').getOrCreate()
data = [('1', 10, 'A'),
('1', 10, 'A'),
('1', 10, None),
('1', 15, 'A'),
('1', 15, None),
('2', 11, 'A'),
('2', 11, 'C'),
('2', 12, 'B'),
('2', 12, 'B'),
('2', 12, 'C'),
('2', 12, 'C'),
('2', 13, None),
('3', 14, None),
('3', 14, None),
('3', 15, None),
('4', 21, 'A'),
('4', 21, 'B'),
('4', 21, 'C'),
('4', 21, 'C')]
df = spark.createDataFrame(data=data, schema = ['id', 'cod','flag'])
df.show()
如何使用Pyspark根据上一个数据框架获取下一个数据帧?
+---+---+----+
| id|cod|flag|
+---+---+----+
| 1| 10| A|
| 1| 15| A|
| 2| 11| A|
| 2| 11| C|
| 2| 12| B|
| 2| 12| C|
| 2| 13|null|
| 3| 14|null|
| 3| 15|null|
| 4| 21| A|
| 4| 21| C|
+---+---+----+
The follow DataFrame should be filtered based on the flag column. If the group based on columns id and cod doesn't have any row with value different of None, it's necessary to maintain just a unique row, otherwise, it's necessary to remove the row with None value in column flag.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number,max
spark = SparkSession.builder.appName('Vazio').getOrCreate()
data = [('1', 10, 'A'),
('1', 10, 'A'),
('1', 10, None),
('1', 15, 'A'),
('1', 15, None),
('2', 11, 'A'),
('2', 11, 'C'),
('2', 12, 'B'),
('2', 12, 'B'),
('2', 12, 'C'),
('2', 12, 'C'),
('2', 13, None),
('3', 14, None),
('3', 14, None),
('3', 15, None),
('4', 21, 'A'),
('4', 21, 'B'),
('4', 21, 'C'),
('4', 21, 'C')]
df = spark.createDataFrame(data=data, schema = ['id', 'cod','flag'])
df.show()
How could I obtain the next DataFrame based on last one using PySpark?
+---+---+----+
| id|cod|flag|
+---+---+----+
| 1| 10| A|
| 1| 15| A|
| 2| 11| A|
| 2| 11| C|
| 2| 12| B|
| 2| 12| C|
| 2| 13|null|
| 3| 14|null|
| 3| 15|null|
| 4| 21| A|
| 4| 21| C|
+---+---+----+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这可能是一个解决方案,您可以添加
最后,
this could be a solution
at the end you could just add
使用
pyspark
。改编自这个答案(spark)设置
With
PySpark
. Adapted from this answer (Spark)Setup
回答已编辑的问题:
请注意,组(4,21)包含所有3个标志A,B,C。在问题的描述中,您尚未提及仅留下2个值,但是在示例中以某种方式删除了B,因此我将其假定为错误。
回答原始问题:
您不能仅根据列ID和COD删除重复项,因为无法保证您始终会从中获得值列标志不是 无效的。
Answer to the edited question:
Note that the group (4, 21) contains all the 3 flags A, B, C. In the description of the question, you haven't mentioned leaving just 2 values, but in the example somehow the B was deleted, so I assumed it as a mistake.
Answer to the original question:
You couldn't just remove duplicates based on columns id and cod, as there's no guarantee that you will always get a value from column flag which is not null.