在 SQL 中一次合并间隔
假设我有一个包含两列的表:start
和 end
,均为整数,并且该表按第一列排序,然后按第二列排序。每行代表一个区间。
我需要的是合并间隔表:所有重叠或相邻的间隔都被合并为一个。
它可以使用 JOIN 查询构建,但行数是二次方,在我的例子中为 400 万行(我决定撰写这个问题,因为查询仍在运行)。
它也可以在单次中完成,通过运行每一行并跟踪最大结束时间 - 但如何在标准 SQL 中做到这一点或类似的事情?有任何 O(n) 的方法在 SQL 中做到这一点吗?我现在正在使用 SQLite;这次,特定于 SQLite 的解决方案也会帮助我。
从相关问题的答案(1,2,3,4, 5,6, 7,8,9)我不知道这是否可能。
你可以吗?
Let's say I have a table with two columns: start
and end
, both integers, and the table is ordered by the first, then second column. Each row represents an interval.
What I need is the table of merged intervals: all overlapping or adjacent intervals gobbled up into one.
It can be constructed with a JOIN query, but that is quadratic in the number of rows, which is 4 million rows in my case (I decided to compose this question because the query is still running).
It can also be done in a single pass, by running through each row and keeping track of the maximum end time - but how to do that, or something equivalent, in standard SQL? Is there any O(n) way to do it in SQL? I'm using SQLite right now; a SQLite-specific solution would also help me out this time.
From the answers to related questions (1, 2, 3, 4, 5, 6, 7, 8, 9) I can't tell whether it's possible.
Can you?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
好吧,这是一个适用于 MySQL 的解决方案(我不知道它是否适用于 SQlite)。我认为,但无法证明,那就是 O(n) (丢弃最初对事件表进行排序所需的时间,即如果它已经按照我认为问题所述的方式进行了排序。)
Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)
在您的链接中,您省略了一个: 我可以使用 SQL Server CTE 来合并相交日期吗?,其中我提出了针对重叠间隔问题的 RECURSIVE CTE 解决方案。递归 CTE 可以以不同的方式处理(与普通的自连接相比),并且通常执行速度快得惊人。
mysql 没有递归 CTE。 Postgres 有,Oracle 有,Microsoft 有。
这里 查询连续列的“运行” Postgres 中的 是另一种,带有模糊因素。
这里 获取总时间间隔如果序列没有破坏,则多行 是另一种。
In your links you have omitted one: Can I use a SQL Server CTE to merge intersecting dates? where I present a RECURSIVE CTE solution to the overlapping intervals problem. Recursive CTE's can be handled differently (compared to ordinary self-joins), and often perform amazingly fast.
mysql does not have recursive CTEs. Postgres has them, Oracle has them, Microsoft has them.
Here Querying for a 'run' of consecutive columns in Postgres is another one, with a fudge-factor.
Here Get total time interval from multiple rows if sequence not broken is yet another one.
根据评论中我的问题的答案,我认为我的想法行不通。既然你提到你可以(并且我假设你知道如何)通过连接来完成,我有一个想法,通过仅保留属于不同点的范围来最小化要连接的行数,如下所示:
上面的内部选择使得确保没有终点重复,并为每个终点选择最长的起点。外部选择的作用恰恰相反。我们最终得到在不同点开始和结束的范围(删除任何完全包含/重叠的范围)。
如果最大范围不大,这可能会起作用。如果这些是日期,并且整个表中的最低日期和其中的最高日期之间存在最大一年差异,那么选择任意两个点的选项将是 365*364,这将是可能行的上限经过以上选择后。然后可以使用您已有的连接方法在临时表中使用这些数据。但根据你提到的数字,理论上我们有一个巨大的数字,使得这次尝试变得无关紧要。即使上面最小化了计算中使用的行,它们仍然太多而无法在连接中使用。
当 RDBMS 没有提供其他非标准功能时,我不知道有什么方法可以在没有连接的 ANSI SQL 中实现这一点。例如,在 Oracle 中,这可以通过分析函数轻松实现。在这种情况下,最好使用上述方法来最大程度地减少所使用的行数,并将它们带到您的应用程序中,然后您可以编写计算范围的代码并将它们插回到数据库中。
Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:
the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed).
This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.
I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.
目前,我找到的最佳答案是:使用索引。
这将复杂性从二次降低到 O(n log n)。
使用覆盖索引,查询的速度足以满足我的需求;
仅在开始列或结束列上有一个索引,速度较慢,但仍然可以。
在每种情况下,
EXPLAIN QUERY PLAN
都告诉我,正如预期的那样,单个表扫描与索引的使用相结合。在索引中查找元素并不完全是 O(1),但事实证明已经足够接近了。
建立索引的速度也不慢。
剩下的就是证明真正的 O(n) 算法无法用 SQL 编写。
所以另一个答案是用不同的语言编写它,然后将其应用到 SQLite 表。
有多种方法可以实现这一点:
For now, the best answer I've found is: use indexing.
This brings the complexity down from quadratic to O(n log n).
With a covering index, the queries turned out to be fast enough for my needs;
with just an index on either the start or end column, it was slower but still OK.
In each case,
EXPLAIN QUERY PLAN
told me that a single table scan is combined with use of the index, as expected.Finding an element in the index isn't quite O(1), but turned out to be close enough.
And building the index isn't slow, either.
What remains is the proof that a true O(n) algorithm can't be written in SQL.
So another answer is to write it in a different language and then apply it to a SQLite table.
There are various ways to make that work: