在 SQL 中一次合并间隔

发布于 2024-12-20 11:41:13 字数 1309 浏览 1 评论 0原文

假设我有一个包含两列的表：start 和 end，均为整数，并且该表按第一列排序，然后按第二列排序。每行代表一个区间。

我需要的是合并间隔表：所有重叠或相邻的间隔都被合并为一个。

它可以使用 JOIN 查询构建，但行数是二次方，在我的例子中为 400 万行（我决定撰写这个问题，因为查询仍在运行）。

它也可以在单次中完成，通过运行每一行并跟踪最大结束时间 - 但如何在标准 SQL 中做到这一点或类似的事情？有任何 O(n) 的方法在 SQL 中做到这一点吗？我现在正在使用 SQLite；这次，特定于 SQLite 的解决方案也会帮助我。

从相关问题的答案（1，2，3，4, 5，6, 7，8，9）我不知道这是否可能。

你可以吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

留一抹残留的笑 2024-12-27 11:41:13

好吧，这是一个适用于 MySQL 的解决方案（我不知道它是否适用于 SQlite）。我认为，但无法证明，那就是 O(n) （丢弃最初对事件表进行排序所需的时间，即如果它已经按照我认为问题所述的方式进行了排序。）

> SELECT * from events;
+-------+-----+
| start | end |
+-------+-----+
|     1 |   9 |
|     5 |   8 |
|     8 |  11 |
|    11 |  13 |
|    17 |  25 |
|    18 |  26 |
|    33 |  42 |
|    59 |  81 |
|    61 |  87 |
|    97 | 132 |
|   105 | 191 |
|   107 | 240 |
|   198 | 213 |
|   202 | 215 |
+-------+-----+
14 rows in set (0.00 sec)


SET @interval_id = 0;
SET @interval_end = 0;

SELECT
  MIN(start) AS start,
  MAX(end) AS end
  FROM
    (SELECT
       @interval_id := IF(start > @interval_end,
                          @interval_id + 1,
                          @interval_id) AS interval_id,
       @interval_end := IF(start < @interval_end,
                           GREATEST(@interval_end, end),
                           end) AS interval_end,
       events.*
     FROM events
     ORDER BY start,end) tmp
  GROUP BY interval_id;

+-------+------+
| start | end  |
+-------+------+
|     1 |   13 |
|    17 |   26 |
|    33 |   42 |
|    59 |   87 |
|    97 |  240 |
+-------+------+
5 rows in set (0.00 sec)

Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)

> SELECT * from events;
+-------+-----+
| start | end |
+-------+-----+
|     1 |   9 |
|     5 |   8 |
|     8 |  11 |
|    11 |  13 |
|    17 |  25 |
|    18 |  26 |
|    33 |  42 |
|    59 |  81 |
|    61 |  87 |
|    97 | 132 |
|   105 | 191 |
|   107 | 240 |
|   198 | 213 |
|   202 | 215 |
+-------+-----+
14 rows in set (0.00 sec)


SET @interval_id = 0;
SET @interval_end = 0;

SELECT
  MIN(start) AS start,
  MAX(end) AS end
  FROM
    (SELECT
       @interval_id := IF(start > @interval_end,
                          @interval_id + 1,
                          @interval_id) AS interval_id,
       @interval_end := IF(start < @interval_end,
                           GREATEST(@interval_end, end),
                           end) AS interval_end,
       events.*
     FROM events
     ORDER BY start,end) tmp
  GROUP BY interval_id;

+-------+------+
| start | end  |
+-------+------+
|     1 |   13 |
|    17 |   26 |
|    33 |   42 |
|    59 |   87 |
|    97 |  240 |
+-------+------+
5 rows in set (0.00 sec)

回复收藏 0 原文

-小熊_ 2024-12-27 11:41:13

在您的链接中，您省略了一个：我可以使用 SQL Server CTE 来合并相交日期吗？，其中我提出了针对重叠间隔问题的 RECURSIVE CTE 解决方案。递归 CTE 可以以不同的方式处理（与普通的自连接相比），并且通常执行速度快得惊人。

mysql 没有递归 CTE。 Postgres 有，Oracle 有，Microsoft 有。

这里查询连续列的“运行” Postgres 中的是另一种，带有模糊因素。

这里获取总时间间隔如果序列没有破坏，则多行是另一种。

回复收藏 0 原文

泪冰清 2024-12-27 11:41:13

根据评论中我的问题的答案，我认为我的想法行不通。既然你提到你可以（并且我假设你知道如何）通过连接来完成，我有一个想法，通过仅保留属于不同点的范围来最小化要连接的行数，如下所示：

select start, max(end) as end
from (
      select min(start) as start,end
      from table
      group by end
     ) in_tab
group by in_tab.start

上面的内部选择使得确保没有终点重复，并为每个终点选择最长的起点。外部选择的作用恰恰相反。我们最终得到在不同点开始和结束的范围（删除任何完全包含/重叠的范围）。
如果最大范围不大，这可能会起作用。如果这些是日期，并且整个表中的最低日期和其中的最高日期之间存在最大一年差异，那么选择任意两个点的选项将是 365*364，这将是可能行的上限经过以上选择后。然后可以使用您已有的连接方法在临时表中使用这些数据。但根据你提到的数字，理论上我们有一个巨大的数字，使得这次尝试变得无关紧要。即使上面最小化了计算中使用的行，它们仍然太多而无法在连接中使用。

当 RDBMS 没有提供其他非标准功能时，我不知道有什么方法可以在没有连接的 ANSI SQL 中实现这一点。例如，在 Oracle 中，这可以通过分析函数轻松实现。在这种情况下，最好使用上述方法来最大程度地减少所使用的行数，并将它们带到您的应用程序中，然后您可以编写计算范围的代码并将它们插回到数据库中。

Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:

select start, max(end) as end
from (
      select min(start) as start,end
      from table
      group by end
     ) in_tab
group by in_tab.start

the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed).
This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.

I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.

回复收藏 0 原文