计数窗口中过滤值的出现

发布于 2025-02-11 20:38:35 字数 1368 浏览 0 评论 0原文

我有以下数据框:

|     Timestamp     |   info   |
+-------------------+----------+
|2016-01-01 17:54:30|     8    |
|2016-02-01 12:16:18|     2    |
|2016-03-01 12:17:57|     1    |
|2016-04-01 10:05:21|     2    |
|2016-05-11 18:58:25|     7    |
|2016-06-11 11:18:29|     6    |
|2016-07-01 12:05:21|     3    |
|2016-08-11 11:58:25|     2    |
|2016-09-11 15:18:29|     9    |

我想创建一个名为 count 的新列,该列在窗口(-2,0)(当前行和前两个)中计数多少值是> 5(在我无法执行操作的前两个行中,我会输入0)。

最终的表应该是:

|     Timestamp     |   info   |   count  |
+-------------------+----------+----------+
|2016-01-01 17:54:30|     8    |     0    |
|2016-02-01 12:16:18|     2    |     0    |
|2016-03-01 12:17:57|     1    |     1    |
|2016-04-01 10:05:21|     2    |     0    |
|2016-05-11 18:58:25|     7    |     1    |
|2016-06-11 11:18:29|     6    |     2    |
|2016-07-01 12:05:21|     3    |     2    |
|2016-08-11 11:58:25|     2    |     1    |
|2016-09-11 15:18:29|     9    |     1    |

我尝试这样做,但它不起作用:

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))

I have the following dataframe:

|     Timestamp     |   info   |
+-------------------+----------+
|2016-01-01 17:54:30|     8    |
|2016-02-01 12:16:18|     2    |
|2016-03-01 12:17:57|     1    |
|2016-04-01 10:05:21|     2    |
|2016-05-11 18:58:25|     7    |
|2016-06-11 11:18:29|     6    |
|2016-07-01 12:05:21|     3    |
|2016-08-11 11:58:25|     2    |
|2016-09-11 15:18:29|     9    |

I would like to create a new column named count which counts in a window(-2, 0) (current row and previous two) how many values are > 5 (in the first two rows where I cannot perform the operation I would put 0).

The resulting table should be:

|     Timestamp     |   info   |   count  |
+-------------------+----------+----------+
|2016-01-01 17:54:30|     8    |     0    |
|2016-02-01 12:16:18|     2    |     0    |
|2016-03-01 12:17:57|     1    |     1    |
|2016-04-01 10:05:21|     2    |     0    |
|2016-05-11 18:58:25|     7    |     1    |
|2016-06-11 11:18:29|     6    |     2    |
|2016-07-01 12:05:21|     3    |     2    |
|2016-08-11 11:58:25|     2    |     1    |
|2016-09-11 15:18:29|     9    |     1    |

I tried to do this but it didn't work:

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笑着哭最痛 2025-02-18 20:38:36

如果您不介意针对前2行执行的计算,则以下内容将有效。

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))

df_input.show()
# +-------------------+----+-----+
# |          Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30|   8|    1|
# |2016-02-01 12:16:18|   2|    1|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
# +-------------------+----+-----+

如果您需要2排为0,而无需更改窗口,则可以在条件时使用此

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
    'count',
    F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
     .otherwise(0)
)
df_input.show()
# +-------------------+----+-----+
# |          Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30|   8|    0|
# |2016-02-01 12:16:18|   2|    0|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
# +-------------------+----+-----+

The following would work if you don't mind calculations performed for the first 2 rows.

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))

df_input.show()
# +-------------------+----+-----+
# |          Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30|   8|    1|
# |2016-02-01 12:16:18|   2|    1|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
# +-------------------+----+-----+

If you need 2 first rows to be 0, without changing the window, you can use this when condition:

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
    'count',
    F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
     .otherwise(0)
)
df_input.show()
# +-------------------+----+-----+
# |          Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30|   8|    0|
# |2016-02-01 12:16:18|   2|    0|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
# +-------------------+----+-----+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文