Pyspark有条件运行的窗口

发布于 2025-02-12 08:21:43 字数 725 浏览 2 评论 0原文

我有一个看起来像这样的数据框,

starttime           |  endtime            |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |

我想添加另一列是ROWNUMBER/等级,如果开始时间和末日之间的差异大于1秒,则会增加。

因此,预期的结果应该看起来像这样

starttime           |  endtime            | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2     |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3     |

I have a dataframe which looks like this

starttime           |  endtime            |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |

I want to add another column which is a rownumber/rank which increments if difference between starttime and endtime is greater than 1 second.

So, the expected result should look something like this

starttime           |  endtime            | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2     |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3     |

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

扬花落满肩 2025-02-19 08:21:43

据我了解,条目将按启动时间订购,而当启动时间和下一个终端时间之间的差异大于第二个时,我们需要组+1。

因此,根据上述要求,我们可以使用滞后和运行总和实现相同

test=spark.createDataFrame([("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:55","2022-01-01 03:25:54"),("2022-01-01 03:25:57","2022-01-01 03:25:57")],"starttime string,endtime string")

test.withColumn("starttime",to_timestamp("starttime")).withColumn("endtime",to_timestamp("starttime")).\
withColumn("diff",when(unix_timestamp("endtime")-unix_timestamp(lag("starttime",1).over(Window.orderBy("starttime")))>1,1).otherwise(0)).\
withColumn("group",1+sum("diff").over(Window.orderBy("starttime"))).\
drop("diff").show()

#output
+-------------------+-------------------+-----+
|          starttime|            endtime|group|
+-------------------+-------------------+-----+
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:55|2022-01-01 03:25:55|    2|
|2022-01-01 03:25:57|2022-01-01 03:25:57|    3|
+-------------------+-------------------+-----+

From what I understand the entries will be ordered by starttime and when difference between starttime and next endtime is greater than second we need group +1.

So based on above requirement we can use lag and running sum to achieve the same

test=spark.createDataFrame([("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:55","2022-01-01 03:25:54"),("2022-01-01 03:25:57","2022-01-01 03:25:57")],"starttime string,endtime string")

test.withColumn("starttime",to_timestamp("starttime")).withColumn("endtime",to_timestamp("starttime")).\
withColumn("diff",when(unix_timestamp("endtime")-unix_timestamp(lag("starttime",1).over(Window.orderBy("starttime")))>1,1).otherwise(0)).\
withColumn("group",1+sum("diff").over(Window.orderBy("starttime"))).\
drop("diff").show()

#output
+-------------------+-------------------+-----+
|          starttime|            endtime|group|
+-------------------+-------------------+-----+
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:55|2022-01-01 03:25:55|    2|
|2022-01-01 03:25:57|2022-01-01 03:25:57|    3|
+-------------------+-------------------+-----+

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文