根据不同事件的首次和最后一次出现过滤
我有一个带有ID,TIMESTAMP和一个活动代码的数据框。我想在第一次出现一个活动代码和另一个活动代码的最后一次出现之间删除每个ID的数据框中的行。
假设我有一个数据框,如上所述,我需要过滤第一次出现之间活性功能C和活性功能的最后一次发生f。我需要为所有trnum发生这种情况。
这是我尝试做的事情:
w = Window.partitionBy('trnum').orderBy('activitylogdate').rowsBetween(Window.unboundPreceding, Window.currentRow)
df_updated = df.withColumn("ignore_col",
sum(when(
(col("activitylogdate") >= list(df.filter((col("activityfunccode") == "c")).select(first('activitylogdate')).collect()[0])[0]) &
(col("activitylogdate") <= list(df.filter((col("activityfunccode") == "f")).select(last('activitylogdate')).collect()[0])[0])
, 1)
.otherwise(0)
).over(w)
).orderBy("activitylogdate")
这里的想法是创建一个新的列,并在以后在上述操作之后以零为零的所有行删除所有行,
这是不起作用的。它仅适用于第一个窗口,最终条件根本不起作用。有人可以帮我吗?
预期输出:
I have a dataframe with id, timestamp and an activity-code. I want to remove the rows in the data frame for each id between the first occurrence of one activity-code and the last occurrence of another activity-code.
Let's say I have a dataframe as shown above and I need to filter all activities between the first occurrence of activity-function c and the last occurrence of activity-function f. I need this to happen for all the trnum.
Here's what I tried doing:
w = Window.partitionBy('trnum').orderBy('activitylogdate').rowsBetween(Window.unboundPreceding, Window.currentRow)
df_updated = df.withColumn("ignore_col",
sum(when(
(col("activitylogdate") >= list(df.filter((col("activityfunccode") == "c")).select(first('activitylogdate')).collect()[0])[0]) &
(col("activitylogdate") <= list(df.filter((col("activityfunccode") == "f")).select(last('activitylogdate')).collect()[0])[0])
, 1)
.otherwise(0)
).over(w)
).orderBy("activitylogdate")
The idea here is to create a new column and remove all the rows later with value non zero after the above operation
This doesn't work. It only works for the first window and the end condition doesn't work at all. Can someone please help me?
Expected output:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您将必须为此创建多个窗口。以下代码假定您在每个分区中都有C和F。
设置:
脚本:
You will have to create several windows for this. The following code assumes that you have both, c and f, in every partition.
Setup:
Script: