从日期时间表中获取包含开始值和结束值的列表
目前我有一个像这样构建的表
DeviceID Timestamp Value
----------------------------------------
Device1 1.1.2011 10:00:00 3
Device1 1.1.2011 10:00:01 4
Device1 1.1.2011 10:00:02 4
Device1 1.1.2011 10:00:04 3
Device1 1.1.2011 10:00:05 4
Device1 1.1.2011 14:23:14 8
Device1 1.1.2011 14:23:15 7
Device1 1.1.2011 14:23:17 4
Device1 1.1.2011 14:23:18 2
正如您所看到的,有一些来自具有给定时间戳的设备的值(列类型是日期时间)。
问题是设备可以在任何时候启动和停止,并且数据中没有发生启动或停止的直接信息。但是从给定时间戳的列表中,很容易判断何时发生开始和停止,因为每当两行的时间戳在五秒内时,它们就属于相同的测量。
现在我想从这些数据中得到一个像这样的列表:
DeviceID Started Ended
Device1 1.1.2011 10:00:00 1.1.2011 10:00:05
Device1 1.1.2011 14:23:14 1.1.2011 14:23:18
那么有什么想法可以快速做到这一点吗?我能想到的就是使用某种光标并手动比较每个日期时间对。但我认为这会变得非常慢,因为我们必须检查每一行中的每个值。
那么有没有更好的不支持游标的 SQL 解决方案呢?
更新
目前我测试了所有给出的答案。通过阅读,它们看起来都不错,并且有一些有趣的方法。不幸的是,他们(到目前为止)都在真实数据上失败了。最大的问题似乎是数据量(目前表中约有 350 万个条目)。仅对一小部分子集执行给定的查询会产生预期的结果,但将查询滚动到整个表只会导致非常糟糕的性能。
我必须进一步测试和检查我是否可以对数据进行分块,并仅将一部分数据传递给这些给定的算法之一来让这个事情滚动。但也许你们中的某个人有另一个聪明的想法,可以更快地获得结果。
更新(有关结构的更多信息)
好的,这些信息也可能有帮助: 目前表中约有 350 万条条目。以下是给定的列类型和索引:
- _ID
- 整数
- 主键
- 分组索引
- 在我的示例中没有提到此列,因为此查询不需要它
- 在我的示例中没有提到此列,因为此查询DeviceID
- 整数
- 不为空
- 索引
- 时间戳
- 日期时间
- 不为空
- 索引
- 值
- 几个不同类型的未索引列(int、real、tinyint)
- 全部可以为空
也许这有助于改进您已经(或新的)给定问题的解决方案。
Currently i have a table built up like this way
DeviceID Timestamp Value
----------------------------------------
Device1 1.1.2011 10:00:00 3
Device1 1.1.2011 10:00:01 4
Device1 1.1.2011 10:00:02 4
Device1 1.1.2011 10:00:04 3
Device1 1.1.2011 10:00:05 4
Device1 1.1.2011 14:23:14 8
Device1 1.1.2011 14:23:15 7
Device1 1.1.2011 14:23:17 4
Device1 1.1.2011 14:23:18 2
As you can see there are coming in some values from a device with a given timestamp (the column type is datetime).
The problem is that the device can be started and stopped at any point and there is no direct information within the data that a start or stop had occured. But from the list of given timestamps it is quite easy to tell when a start and stop had occured, due to the fact that whenever the timestamps of two rows is within five seconds, they belong to the same measurement.
Now i'd like to get out of this data a list like this:
DeviceID Started Ended
Device1 1.1.2011 10:00:00 1.1.2011 10:00:05
Device1 1.1.2011 14:23:14 1.1.2011 14:23:18
So any ideas how to do this in a fast way? All i can think about is using some kind of cursor and compare each datetime pair by hand. But i think this will get really slow cause we have to inspect each value in each row.
So is there any better SQL solution which won't work with cursors?
Update
Currently i tested all the given answers. And by reading they all look good and had some interesting approaches. Unfortunately all of them (so far) failed on the real data. The biggest problem seems to be the mass of the data (currently their are round about 3.5 millions entries in the table). Performing the given query only on a small subset leads to the expected results, but rolling the query onto the whole table just leads to a very bad performance.
I have to further test and examine if i can chunkify the data and only pass a part of the data to one of these given algorithms to get this thing rolling. But maybe one of you has another smart idea to get the results a little bit faster.
Update (More informations about structure)
Okay, these informations might help, too:
Currently there are round about 3.5 million entries in the table. And here are the given column types and indizes:
- _ID
- int
- Primary Key
- Grouped Index
- didn't mentioned this column in my example, cause it isn't needed for this query
- DeviceID
- int
- not null
- Index
- Timestamp
- datetime
- not null
- Index
- Value
- several not indexed columns of different types (int, real, tinyint)
- all can be null
Maybe this helps to improve your already (or new) solutions to the given problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我已经尝试过一些数据类型和名称(只是因为我可以,并且因为时间戳是保留字),并且可以使用示例数据获取您请求的结果。
示例数据:
现在是查询:
StartPeriods
通用表表达式 (CTE) 从 Measures 表中查找 5 秒内没有上一行的行。然后,ExtendPeriods
CTE 通过从测量中查找在找到的时间段当前结束后 5 秒内发生的新行来递归地延长这些时间段。然后,我们找到周期结束点距离开始点尽可能远的行。
I've played around with some datatypes and names (just because I can, and because timestamp is a reserved word), and can get your requested result using your sample data.
Sample data:
and now the query:
The
StartPeriods
Common Table Expression (CTE) finds those rows from the Measures table that don't have a previous row within 5 seconds of them. TheExtendPeriods
CTE then recursively extends these periods by finding new rows from Measures that occur up to 5 seconds after the current end of the found period.We then find the rows where the end of the period was as far away from the start as possible.
试试这个:
Try this:
以下解决方案的基本思想借鉴自 这个答案。
The basic idea for the below solution has been borrowed from this answer.
试试这个,虽然我不确定它在处理大量数据时表现如何,
我的表称为 TestTime,列称为 TS,因此请根据您的表调整它。我使用 NOT EXISTS 来检查时间戳 <当前记录及其 5 秒内 - 因此如果未找到则显示,即开始时间(或表中的第一条记录,然后它将查找大于找到的任何记录的最低时间戳,并且 >= 该记录)时间戳(如果它是单个条目,则为开始/停止条目),并再次使用 NOT EXISTS 来检查大于它且在 5 秒内的记录 - 因此,再次显示是否未找到记录(只有第一个)你可以。可能会调整和改进这一点,但这可能是一个很好的基础,
它将列出最后一次找到的时间作为上次启动事件的停止时间。
请注意,如果它仍在运行, 简单性,因此您需要将其放在 StopTime 和 WHERE 子句中
Try this, though I'm not sure how well it will perform with lots of data
my table is called TestTime and the column is called TS so tweak it for your table. I've used the NOT EXISTS to check for a timestamp < the current record and within 5 seconds of it - so display if not found, i.e. a start time (or the first record in the table and then it will look for the lowest timestamp that is greater than any records found that is >= that timestamp (in case it's a single entry, so a start/stop one) and that again uses NOT EXISTS to check for a record that is greater than it and within 5 seconds - so, again, display if a record isn't found (only the 1st). You can probably tweak and improve this, but it might be a good basis.
Note that if it is still running it will list the last time found as the stop time for the last start event.
I haven't put a device name in here, for simplicity, so you would need to put that in the StopTime and WHERE clauses
您应该能够为此使用窗口函数(假设下面 15 分钟定义了一个新会话):
根据您的 where 子句,您可能希望删除合并/真实部分,因为获取的第一行/最后一行可能会变得无效。
如果您只需要边界,则可以在子查询中使用上述内容,并
按具有 session_begins 或 session_ends 的 DeviceId、session_begins、session_ends 进行分组
。另外,如果您这样做,请不要忘记将 where 子句放在子查询中,而不是主查询中,否则由于窗口聚合,您最终将在整个表上进行 seq 扫描。You should be able to use window functions for this (assuming 15 minutes defines a new session below):
Depending on your where clause, you might want to remove the coalesce/true part since the first/last row fetched may become invalid.
If you need only the boundaries, you could use the above in a subquery and
group by DeviceId, session_begins, session_ends having session_begins or session_ends
. Also, if you do this, don't forget to put the where clause in the subquery, rather than the main one, else you'll end up doing a seq scan on the whole table because of the window aggregate.