根据不同事件的首次和最后一次出现过滤

发布于 2025-02-12 19:29:05 字数 1217 浏览 0 评论 0原文

我有一个带有ID，TIMESTAMP和一个活动代码的数据框。我想在第一次出现一个活动代码和另一个活动代码的最后一次出现之间删除每个ID的数据框中的行。

假设我有一个数据框，如上所述，我需要过滤第一次出现之间活性功能C和活性功能的最后一次发生f。我需要为所有trnum发生这种情况。

这是我尝试做的事情：

w = Window.partitionBy('trnum').orderBy('activitylogdate').rowsBetween(Window.unboundPreceding, Window.currentRow)
df_updated = df.withColumn("ignore_col",
    sum(when(
        (col("activitylogdate") >= list(df.filter((col("activityfunccode") == "c")).select(first('activitylogdate')).collect()[0])[0]) &
        (col("activitylogdate") <= list(df.filter((col("activityfunccode") == "f")).select(last('activitylogdate')).collect()[0])[0])
        , 1)
        .otherwise(0)
    ).over(w)
).orderBy("activitylogdate")

这里的想法是创建一个新的列，并在以后在上述操作之后以零为零的所有行删除所有行，

这是不起作用的。它仅适用于第一个窗口，最终条件根本不起作用。有人可以帮我吗？

预期输出：

原文

I have a dataframe with id, timestamp and an activity-code. I want to remove the rows in the data frame for each id between the first occurrence of one activity-code and the last occurrence of another activity-code.

Let's say I have a dataframe as shown above and I need to filter all activities between the first occurrence of activity-function c and the last occurrence of activity-function f. I need this to happen for all the trnum.

Here's what I tried doing:

w = Window.partitionBy('trnum').orderBy('activitylogdate').rowsBetween(Window.unboundPreceding, Window.currentRow)
df_updated = df.withColumn("ignore_col",
    sum(when(
        (col("activitylogdate") >= list(df.filter((col("activityfunccode") == "c")).select(first('activitylogdate')).collect()[0])[0]) &
        (col("activitylogdate") <= list(df.filter((col("activityfunccode") == "f")).select(last('activitylogdate')).collect()[0])[0])
        , 1)
        .otherwise(0)
    ).over(w)
).orderBy("activitylogdate")

The idea here is to create a new column and remove all the rows later with value non zero after the above operation

This doesn't work. It only works for the first window and the end condition doesn't work at all. Can someone please help me?

Expected output:

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凡间太子 2025-02-19 19:29:05

您将必须为此创建多个窗口。以下代码假定您在每个分区中都有C和F。

设置：

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1234, "a", 1),
     (1234, "b", 2),
     (1234, "j", 3),
     (1234, "c", 4),
     (1234, "d", 5),
     (1234, "f", 6),
     (1234, "e", 7),
     (1234, "f", 8),
     (1234, "t", 9),
     (1235, "a", 1),
     (1235, "c", 2),
     (1235, "x", 3),
     (1235, "f", 4),
     (1235, "t", 5)],
    ["trnum", "activityfunccode", "activitylogdate"])

脚本：

w1 = W.partitionBy("trnum", "activityfunccode").orderBy(F.desc("activitylogdate"))
w2 = W.partitionBy("trnum", "activityfunccode").orderBy("activitylogdate")
w3 = W.partitionBy("trnum").orderBy("activitylogdate")

df = df.withColumn("f_last", F.when((F.row_number().over(w1) == 1) & (F.col("activityfunccode") == "f"), 1))
df = df.withColumn("c_first", F.when((F.row_number().over(w2) == 1) & (F.col("activityfunccode") == "c"), 1))
df = df.withColumn("group", F.sum(F.coalesce("c_first", "f_last")).over(w3))
df = df.filter(((F.col("group") != 1) | F.isnull("group")) & F.isnull("f_last"))
df = df.drop("c_first", "f_last", "group")

df.show()
# +-----+----------------+---------------+
# |trnum|activityfunccode|activitylogdate|
# +-----+----------------+---------------+
# | 1234|               a|              1|
# | 1234|               b|              2|
# | 1234|               j|              3|
# | 1234|               t|              9|
# | 1235|               a|              1|
# | 1235|               t|              5|
# +-----+----------------+---------------+

You will have to create several windows for this. The following code assumes that you have both, c and f, in every partition.

Setup:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1234, "a", 1),
     (1234, "b", 2),
     (1234, "j", 3),
     (1234, "c", 4),
     (1234, "d", 5),
     (1234, "f", 6),
     (1234, "e", 7),
     (1234, "f", 8),
     (1234, "t", 9),
     (1235, "a", 1),
     (1235, "c", 2),
     (1235, "x", 3),
     (1235, "f", 4),
     (1235, "t", 5)],
    ["trnum", "activityfunccode", "activitylogdate"])

Script:

w1 = W.partitionBy("trnum", "activityfunccode").orderBy(F.desc("activitylogdate"))
w2 = W.partitionBy("trnum", "activityfunccode").orderBy("activitylogdate")
w3 = W.partitionBy("trnum").orderBy("activitylogdate")

df = df.withColumn("f_last", F.when((F.row_number().over(w1) == 1) & (F.col("activityfunccode") == "f"), 1))
df = df.withColumn("c_first", F.when((F.row_number().over(w2) == 1) & (F.col("activityfunccode") == "c"), 1))
df = df.withColumn("group", F.sum(F.coalesce("c_first", "f_last")).over(w3))
df = df.filter(((F.col("group") != 1) | F.isnull("group")) & F.isnull("f_last"))
df = df.drop("c_first", "f_last", "group")

df.show()
# +-----+----------------+---------------+
# |trnum|activityfunccode|activitylogdate|
# +-----+----------------+---------------+
# | 1234|               a|              1|
# | 1234|               b|              2|
# | 1234|               j|              3|
# | 1234|               t|              9|
# | 1235|               a|              1|
# | 1235|               t|              5|
# +-----+----------------+---------------+

回复收藏 0 原文

~没有更多了~