Pyspark有条件运行的窗口
我有一个看起来像这样的数据框,
starttime | endtime |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |
我想添加另一列是ROWNUMBER/等级,如果开始时间和末日之间的差异大于1秒,则会增加。
因此,预期的结果应该看起来像这样
starttime | endtime | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3 |
I have a dataframe which looks like this
starttime | endtime |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 |
I want to add another column which is a rownumber/rank which increments if difference between starttime and endtime is greater than 1 second.
So, the expected result should look something like this
starttime | endtime | group |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:53 | 2022-01-01 03:25:52 | 1 |
2022-01-01 03:25:55 | 2022-01-01 03:25:54 | 2 |
2022-01-01 03:25:57 | 2022-01-01 03:25:57 | 3 |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我了解,条目将按启动时间订购,而当启动时间和下一个终端时间之间的差异大于第二个时,我们需要组+1。
因此,根据上述要求,我们可以使用滞后和运行总和实现相同
From what I understand the entries will be ordered by starttime and when difference between starttime and next endtime is greater than second we need group +1.
So based on above requirement we can use lag and running sum to achieve the same