Pyspark有条件运行的窗口

发布于 2025-02-12 08:21:43 字数 725 浏览 2 评论 0原文

我有一个看起来像这样的数据框，

starttime           |  endtime            |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |

我想添加另一列是ROWNUMBER/等级，如果开始时间和末日之间的差异大于1秒，则会增加。

因此，预期的结果应该看起来像这样

starttime           |  endtime            | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2     |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3     |

原文

I have a dataframe which looks like this

starttime           |  endtime            |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |

I want to add another column which is a rownumber/rank which increments if difference between starttime and endtime is greater than 1 second.

So, the expected result should look something like this

starttime           |  endtime            | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1     |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2     |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3     |

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扬花落满肩 2025-02-19 08:21:43

据我了解，条目将按启动时间订购，而当启动时间和下一个终端时间之间的差异大于第二个时，我们需要组+1。

因此，根据上述要求，我们可以使用滞后和运行总和实现相同

test=spark.createDataFrame([("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:55","2022-01-01 03:25:54"),("2022-01-01 03:25:57","2022-01-01 03:25:57")],"starttime string,endtime string")

test.withColumn("starttime",to_timestamp("starttime")).withColumn("endtime",to_timestamp("starttime")).\
withColumn("diff",when(unix_timestamp("endtime")-unix_timestamp(lag("starttime",1).over(Window.orderBy("starttime")))>1,1).otherwise(0)).\
withColumn("group",1+sum("diff").over(Window.orderBy("starttime"))).\
drop("diff").show()

#output
+-------------------+-------------------+-----+
|          starttime|            endtime|group|
+-------------------+-------------------+-----+
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:55|2022-01-01 03:25:55|    2|
|2022-01-01 03:25:57|2022-01-01 03:25:57|    3|
+-------------------+-------------------+-----+

From what I understand the entries will be ordered by starttime and when difference between starttime and next endtime is greater than second we need group +1.

So based on above requirement we can use lag and running sum to achieve the same

test=spark.createDataFrame([("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:53","2022-01-01 03:25:52"),("2022-01-01 03:25:55","2022-01-01 03:25:54"),("2022-01-01 03:25:57","2022-01-01 03:25:57")],"starttime string,endtime string")

test.withColumn("starttime",to_timestamp("starttime")).withColumn("endtime",to_timestamp("starttime")).\
withColumn("diff",when(unix_timestamp("endtime")-unix_timestamp(lag("starttime",1).over(Window.orderBy("starttime")))>1,1).otherwise(0)).\
withColumn("group",1+sum("diff").over(Window.orderBy("starttime"))).\
drop("diff").show()

#output
+-------------------+-------------------+-----+
|          starttime|            endtime|group|
+-------------------+-------------------+-----+
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:53|2022-01-01 03:25:53|    1|
|2022-01-01 03:25:55|2022-01-01 03:25:55|    2|
|2022-01-01 03:25:57|2022-01-01 03:25:57|    3|
+-------------------+-------------------+-----+

回复收藏 0 原文

~没有更多了~