SQL查询挑战

发布于 2024-08-06 23:45:15 字数 1189 浏览 8 评论 0原文

所以这是另一个“向 X 写入查询”挑战。

我正在监控许多联网的自动售货机。每台机器都有许多部件，例如钞票接收器、投币系统、打印机等。

机器部件的问题记录在表中，我们称之为“故障”，看起来像这样（省略不相关的字段）：

machineid           partid         start_time            end_time
---------           ------         ----------------      ----------------
       1                2          2009-10-05 09:00      NULL
       1                3          2009-10-05 08:00      2009-10-05 10:00
       2                2          2009-09-30 12:00      2009-09-30 14:00
       3                4          2009-09-28 13:00      2009-09-28 15:00
       3                2          2009-09-28 12:00      2009-09-28 14:00

如果问题当前正在进行，则 end_date 为 NULL。

我需要一个查询来显示整个机器停机的时间段，并且可以考虑重叠范围，将它们折叠成单个记录。因此，对于上面的示例数据，它将产生：

machineid          start_time            end_time
---------          ----------------      ----------------
       1           2009-10-05 08:00      NULL
       2           2009-09-30 12:00      2009-09-30 14:00
       3           2009-09-28 12:00      2009-09-28 15:00

编写过程代码来逐行执行此操作并不困难，但一个漂亮的声明性 SQL 查询会更有用、更优雅。看起来这应该是可能的，但我只是无法完全到达那里。

SQL 方言是 Oracle。如果有帮助的话，可以使用分析函数。

谢谢！

原文

So here's yet another 'write a query to X' challenge.

I'm monitoring a number of networked vending machines. Each machine has a number of parts, e.g. bank note acceptor, coin system, printer and so on.

Problems with machine parts are logged in table, let's call it 'faults', which looks something like this (irrelevant fields omitted):

machineid           partid         start_time            end_time
---------           ------         ----------------      ----------------
       1                2          2009-10-05 09:00      NULL
       1                3          2009-10-05 08:00      2009-10-05 10:00
       2                2          2009-09-30 12:00      2009-09-30 14:00
       3                4          2009-09-28 13:00      2009-09-28 15:00
       3                2          2009-09-28 12:00      2009-09-28 14:00

end_date is NULL if the problem is currently ongoing.

I need a query which show time periods for which the machine as a whole is down, and which can account for overlapping ranges, collapsing them down into a single record. So for the sample data above, it would produce:

machineid          start_time            end_time
---------          ----------------      ----------------
       1           2009-10-05 08:00      NULL
       2           2009-09-30 12:00      2009-09-30 14:00
       3           2009-09-28 12:00      2009-09-28 15:00

It's not tough to write procedural code to do this line by line, but a nice declarative SQL query would be more useful, more elegant. It seems like it ought to be possible, I just can't quite get there though.

SQL dialect is Oracle. Analytic functions are availabe if that would help.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酸甜透明夹心 2024-08-13 23:45:15

使用分析，您可以构建一个对数据进行单次传递的查询（对于大型数据集，这将是最有效的）：

SELECT machineid, MIN(start_time), MAX(end_time)
  FROM (SELECT machineid, start_time, end_time, 
               SUM(gap) over(PARTITION BY machineid 
                             ORDER BY start_time) contiguous_faults
           FROM (SELECT machineid, start_time, 
                        coalesce(end_time, DATE '9999-12-31') end_time,
                         CASE
                            WHEN start_time > MAX(coalesce(end_time, 
                                                           DATE '9999-12-31'))
                                              over(PARTITION BY machineid 
                                                   ORDER BY start_time 
                                                   ROWS BETWEEN UNBOUNDED PRECEDING
                                                            AND 1 preceding)
                            THEN 1
                         END gap
                    FROM faults))
 GROUP BY machineid, contiguous_faults
 ORDER BY 1, 2

该查询首先确定一行是否与之前开始的任何行相邻。然后我们对连续的行进行分组。

using analytics, you can build a query that will make a single pass on the data (with a large data set this will be the most efficient):

SELECT machineid, MIN(start_time), MAX(end_time)
  FROM (SELECT machineid, start_time, end_time, 
               SUM(gap) over(PARTITION BY machineid 
                             ORDER BY start_time) contiguous_faults
           FROM (SELECT machineid, start_time, 
                        coalesce(end_time, DATE '9999-12-31') end_time,
                         CASE
                            WHEN start_time > MAX(coalesce(end_time, 
                                                           DATE '9999-12-31'))
                                              over(PARTITION BY machineid 
                                                   ORDER BY start_time 
                                                   ROWS BETWEEN UNBOUNDED PRECEDING
                                                            AND 1 preceding)
                            THEN 1
                         END gap
                    FROM faults))
 GROUP BY machineid, contiguous_faults
 ORDER BY 1, 2

This query starts by determining if a row is contiguous to any row that started before. We then group the rows that are contiguous.

回复收藏 0 原文

凉城已无爱 2024-08-13 23:45:15

基本上，你不能在纯集理论中做到这一点（找到森林的覆盖分区集）（例如，作为没有循环的有界查询）。

要以最类似于集合的方式执行此操作，

创建一个用于林分区的临时表（10 或 11 列，4 来自故障 #1，4 来自故障 #2，1 表示分区 ID，1 表示哪个节点的轮次）被插入，1用于各种优化，我在38C的发烧下无法想到。
运行一个循环（BFS 或 DFS，无论您发现什么更容易实现森林分区算法），与图相比，棘手的部分是您可以将许多子树连接起来顶部到当前子树
您可以使用sheepsimulator的查询作为循环的基本构建块（例如查找2个连接的节点）
当分区循环完成，只需执行

   select min(p1.start_time), max(p2.end_time), p1.partition,p2.partition
   from partitions p1, partitions p2
   where p1.partition = p2.partition
   group by p1.partition,p2.partition
   

    /* This will need to be tweaked using COALESCE 
       to deal with NULL end times in obvious way) */

我只是为没有拼写森林分区的确切代码而道歉（它可能会在树分区下归档） - 我累坏了，我确信一些谷歌搜索会产生一个现在你知道的tdata 结构和问题名称（或者您可以将其作为更精确表述的 Q 发布在 StackOverflow 上 - 例如“如何实现将森林完全分区为 SQL 中的循环的算法”。

Basically, you can not do it (find a covering partition set of a forest) in pure set theory (e.g. as a bounded # of queries without a loop).

To do it in the most set-like way,

Create a temp table for forest partitioning (10 or 11 columns, 4 from failure #1, 4 from failure #2, 1 for partition ID, 1 for round in which node was inserted, and 1 for assorted optimizations I can't think of with a fever of 38C.
Run a loop (BFS or DFS, whatever you find to easier implement the forest partitioning algorithm in). The tricky part, compared to graphs, is that you can have many sub-trees joined from the top to current sub-tree
You can use sheepsimulator's query as basic building block for the loop (e.g. finding 2 connected node)
When the partitioning loop is done, simply do

   select min(p1.start_time), max(p2.end_time), p1.partition,p2.partition
   from partitions p1, partitions p2
   where p1.partition = p2.partition
   group by p1.partition,p2.partition
   

    /* This will need to be tweaked using COALESCE 
       to deal with NULL end times in obvious way) */

I apologize for not spelling the exact code for forest partitioning (it may be filed under tree partitioning) - I'm dead tired and I'm certain some Googling will yield one now that you know the tdata structure and problem name (or you can post this as a more precisely formulated Q on StackOverflow - e.g. "How to implement an algorithm for complete partitioning of a forest of trees as a loop in SQL".

回复收藏 0 原文

戏蝶舞 2024-08-13 23:45:15

SELECT  DISTINCT 
        t1.machineId, 
        MIN(t2.start_time) start_time, 
        MAX(COALESCE(t2.end_time, '3210/01/01')) end_time
FROM FAULTS t1
JOIN FAULTS t2 ON t1.machineId = t2.machineId
                  AND ((t2.start_time >= t1.start_time
                       AND (t1.end_time IS NULL OR t2.start_time <= t1.end_time)
                  )
                  OR
                  (t1.start_time >= t2.start_time 
                       AND (t2.end_time IS NULL OR t1.start_time <= t2.end_time) 
                  ))
GROUP BY t1.machineId, t1.part_id

我用以下数据测试了这个查询：

machine_id   |part_id |start_time           |end_time
-------------------------------------------------------------------------
1           |2       |05 Oct 2009 09:00:00  |NULL
1           |3       |05 Oct 2009 08:00:00  |05 Oct 2009 10:00:00
2           |2       |30 Sep 2009 12:00:00  |30 Sep 2009 14:00:00
2           |3       |30 Sep 2009 15:00:00  |30 Sep 2009 16:00:00
2           |4       |30 Sep 2009 16:00:00  |30 Sep 2009 17:00:00
3           |2       |28 Sep 2009 12:00:00  |28 Sep 2009 14:00:00
3           |4       |28 Sep 2009 13:00:00  |28 Sep 2009 15:00:00

我得到了这个：

machine_id   |start_time             |end_time
-----------------------------------------------------------------
1           |05 Oct 2009 08:00:00   |01 Jan 3210 00:00:00
2           |30 Sep 2009 12:00:00   |30 Sep 2009 14:00:00
2           |30 Sep 2009 15:00:00   |30 Sep 2009 17:00:00
3           |28 Sep 2009 12:00:00   |28 Sep 2009 15:00:00

SELECT  DISTINCT 
        t1.machineId, 
        MIN(t2.start_time) start_time, 
        MAX(COALESCE(t2.end_time, '3210/01/01')) end_time
FROM FAULTS t1
JOIN FAULTS t2 ON t1.machineId = t2.machineId
                  AND ((t2.start_time >= t1.start_time
                       AND (t1.end_time IS NULL OR t2.start_time <= t1.end_time)
                  )
                  OR
                  (t1.start_time >= t2.start_time 
                       AND (t2.end_time IS NULL OR t1.start_time <= t2.end_time) 
                  ))
GROUP BY t1.machineId, t1.part_id

I tested this query on the following data:

machine_id   |part_id |start_time           |end_time
-------------------------------------------------------------------------
1           |2       |05 Oct 2009 09:00:00  |NULL
1           |3       |05 Oct 2009 08:00:00  |05 Oct 2009 10:00:00
2           |2       |30 Sep 2009 12:00:00  |30 Sep 2009 14:00:00
2           |3       |30 Sep 2009 15:00:00  |30 Sep 2009 16:00:00
2           |4       |30 Sep 2009 16:00:00  |30 Sep 2009 17:00:00
3           |2       |28 Sep 2009 12:00:00  |28 Sep 2009 14:00:00
3           |4       |28 Sep 2009 13:00:00  |28 Sep 2009 15:00:00

I got this:

machine_id   |start_time             |end_time
-----------------------------------------------------------------
1           |05 Oct 2009 08:00:00   |01 Jan 3210 00:00:00
2           |30 Sep 2009 12:00:00   |30 Sep 2009 14:00:00
2           |30 Sep 2009 15:00:00   |30 Sep 2009 17:00:00
3           |28 Sep 2009 12:00:00   |28 Sep 2009 15:00:00

回复收藏 0 原文

带刺的爱情 2024-08-13 23:45:15

SELECT machineid, min(start_time), max(ifnull(end_time, '3000-01-01 00:00'))
FROM faults
GROUP BY machineid

应该完成这项工作（如果需要，用等效的 Oracle 函数替换 ifnull）。

SELECT machineid, min(start_time), max(ifnull(end_time, '3000-01-01 00:00'))
FROM faults
GROUP BY machineid

should do the job (replacing ifnull by the equivalent Oracle function if needed).

回复收藏 0 原文

淡紫姑娘！ 2024-08-13 23:45:15

我希望我有时间给出完整的答案，但这里有一个查找重叠停机时间的提示：

select a.machineid, a.start_time, a.end_time, b.start_time, b.end_time
from faults a,
     faults b,
where a.machineid = b.machineid
  and b.start_time >= a.start_time
  and b.start_time <= a.end_time;

I wish I had time to give a full answer, but here is a hint to find overlapping downtimes:

select a.machineid, a.start_time, a.end_time, b.start_time, b.end_time
from faults a,
     faults b,
where a.machineid = b.machineid
  and b.start_time >= a.start_time
  and b.start_time <= a.end_time;

回复收藏 0 原文

耶耶耶 2024-08-13 23:45:15

我相信你需要一个存储过程来做到这一点，或者类似递归“通用表表达式（CTE）（如 SQL srever 中存在）”，否则（在单个 SQL 语句中）你将无法得到正确的答案当 3 行或更多行一起形成覆盖日期的连续范围时。

就像：

 |----------|
           |---------------|
                        |----------------|

在没有实际进行练习的情况下，我可能建议在存储过程中构建一个所有“候选日期”的表，然后构建一个包含现有行中日期范围未涵盖的所有日期的表，然后通过“否定”该集来构建输出结果集。

I believe you would need a stored proc to do this, or something like recursive 'Common Table Expressions (CTEs) (as exists in SQL srever), or otherwise (in a single SQL Statement) you would not be able to get the right answer when 3 or more rows togeher form a contiguous range of covered dates.

like:

 |----------|
           |---------------|
                        |----------------|

Without actually going through the exercise, I might suggest that in the stored proc, build a table of all "candidate dates" and then construct a table that contains all the dates that are NOT covered by a daterange in an existing row, then build your output resultset by "negating" this set.

回复收藏 0 原文