如何找出哪些行已被删除？

发布于 2024-10-27 14:33:33 字数 812 浏览 5 评论 0原文

带有 SE 引擎的 Informix-SQL 7.32：

我有一位客户从 SE 表中删除了几行。（我没有使用事务日志记录或审计）。该表有一个序列列。我想创建一个 Ace 报告来打印缺少的序列列。我尝试了以下快速而肮脏的报告，但它不起作用！..你能建议更好的方法吗？

define
variable next_id integer
end

  select tbl_id 
    from tbl
order by tbl_id {I'm ordering tbl_id because all the rows are periodically re-clustered}
     end        {by an fk_id in order to group all rows belonging to the same customer}

format
on every row
let next_id = tbl_id + 1  

after group of tbl_id
if tbl_id + 1 <> next_id then
print column 1, tbl_id + 1 using "######"

end

或者创建一个临时表，其中包含 1 到 5000 之间的连续数字的 INT 列，并执行如下所示的选择语句：

   SELECT tbl_id 
     FROM tbl
    WHERE tbl_id NOT IN
                 (SELECT tmp_int
                    FROM tmp);

或使用 HAVING、OUTER 等的选择语句。

原文

Informix-SQL 7.32 with SE engine:

I have a customer who deleted several rows from an SE table. (I'm not using transaction logging or audit). The table has a serial column. I would like to create an Ace report to print the missing serial columns. I tried the following quick and dirty report, but it didn't work!.. can you suggest a better way?

define
variable next_id integer
end

  select tbl_id 
    from tbl
order by tbl_id {I'm ordering tbl_id because all the rows are periodically re-clustered}
     end        {by an fk_id in order to group all rows belonging to the same customer}

format
on every row
let next_id = tbl_id + 1  

after group of tbl_id
if tbl_id + 1 <> next_id then
print column 1, tbl_id + 1 using "######"

end

or maybe create a temporary table with an INT column containing sequential numbers from 1 to 5000 and do a select statement like:

   SELECT tbl_id 
     FROM tbl
    WHERE tbl_id NOT IN
                 (SELECT tmp_int
                    FROM tmp);

or a select statement with HAVING, OUTER, etc.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谎言 2024-11-03 14:33:33

由于这是 SE，我们必须使用老式表示法，而不是 SQL-92 JOIN 表示法。

下面的四个查询是两个可能答案的共同基础：

SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id + 1 = t2.tbl_id
  INTO TEMP x1;

SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id - 1 = t2.tbl_id
  INTO TEMP x2;

SELECT tbl_id AS hi_range
  FROM x1
 WHERE ind IS NULL
  INTO TEMP x3;

SELECT tbl_id AS lo_range
  FROM x2
 WHERE ind IS NULL
  INTO TEMP x4;

表 x3 和 x4 现在（分别）包含 tbl_id 的值，这些值没有直接后继也没有直接前驱。每个值都是 tbl_id 值的连续范围的开始或结束。在 IDS（而不是 SE）中，您可以使用标准 SQL OUTER JOIN 表示法，并在两个查询（而不是四个）中过滤联接结果；在东南部你没有那么奢侈。

具有二次（或更糟）行为的非解决方案

现在您只需弄清楚如何组合两个表：

SELECT t1.lo_range, t2.hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t1.lo_range <= t2.hi_range
   AND NOT EXISTS
       (SELECT t3.lo_range, t4.hi_range
          FROM x4 AS t3, x3 AS t4
         WHERE t3.lo_range <= t4.hi_range
           AND t1.lo_range =  t3.lo_range
           AND t2.hi_range >  t4.hi_range
       );

此查询的主要部分出现两次并生成范围开始小于或等于的所有行对范围的末尾（等于允许“范围”由一个值组成，两侧都有已删除的行）。 NOT EXISTS 子句确保不存在具有相同起始值和较小最终值的其他对。

如果数据中有很多间隙，临时表上的查询可能不会很快；如果间隙很少，那么应该没问题。

最后一个查询在范围数量方面表现出二次行为。当我只有十几个范围时，这很好（亚秒响应时间）；当我有 1,200 个范围时，情况不太好 - 没有在合理的时间内完成。

避免二次行为

既然二次行为不好，我们如何重新表述查询...

对于范围的每个低端，找到大于或等于低端的范围的最小高端，或者在 SQL 中：

SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range;

请注意，这可以轻松地合并到 ACE 报告中。它为您提供了存在的数字范围，而不是不存在的数字范围。您可以弄清楚如何生成另一个。

计时

在具有 22100 行且数据中包含 1200 个间隙的表上，该方法表现得相当好。在基准模式 (-B) 下使用（我的）SQLCMD 程序，并将 SELECT 输出发送到 /dev/null，并使用在 MacOS X 10.6.7（MacBook Pro、Intel Core 2 Duo，3 GHz 和4 GB RAM），结果是：

$ sqlcmd -d stores -B -f gaps.sql
+ CLOCK START;
2011-03-31 18:44:39
+ BEGIN;
Time: 0.000588
2011-03-31 18:44:39
+ SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id + 1 = t2.tbl_id
  INTO TEMP x1;
Time: 0.437521
2011-03-31 18:44:39
+ SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
   FROM tbl AS t1, OUTER tbl AS t2
  WHERE t1.tbl_id - 1 = t2.tbl_id
   INTO TEMP x2;
Time: 0.315050
2011-03-31 18:44:39
+ SELECT tbl_id AS hi_range
  FROM x1
 WHERE ind IS NULL
  INTO TEMP x3;
Time: 0.012510
2011-03-31 18:44:39
+ SELECT tbl_id AS lo_range
  FROM x2
 WHERE ind IS NULL
  INTO TEMP x4;
Time: 0.008754
+ output "/dev/null";
2011-03-31 18:44:39
+ SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range;
Time: 0.561935
+ output "/dev/stdout";
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x1;
22100
Time: 0.001171
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x2;
22100
Time: 0.000685
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x3;
1200
Time: 0.000590
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x4;
1200
Time: 0.000768
2011-03-31 18:44:40
+ SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range
 INTO TEMP x5;
Time: 0.529420
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x5;
1200
Time: 0.001155
2011-03-31 18:44:40
+ ROLLBACK;
Time: 0.329379
+ CLOCK STOP;
Time: 2.202523
$

可以；不到几秒钟的处理时间。

Since this is SE, we have to use the old-fashioned notation, not the SQL-92 JOIN notations.

The four queries that follow are a common foundation for the two possible answers:

SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id + 1 = t2.tbl_id
  INTO TEMP x1;

SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id - 1 = t2.tbl_id
  INTO TEMP x2;

SELECT tbl_id AS hi_range
  FROM x1
 WHERE ind IS NULL
  INTO TEMP x3;

SELECT tbl_id AS lo_range
  FROM x2
 WHERE ind IS NULL
  INTO TEMP x4;

The tables x3 and x4 now contain (respectively) the values for tbl_id that have no immediate successor and no immediate predecessor. Each value is the start or end of a contiguous ranges of tbl_id values. In IDS instead of SE, you could use the standard SQL OUTER JOIN notation and filter the results of the join in two queries instead of four; you do not have that luxury in SE.

Non-solution with quadratic (or worse) behaviour

Now you just have to work out how to combine the two tables:

SELECT t1.lo_range, t2.hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t1.lo_range <= t2.hi_range
   AND NOT EXISTS
       (SELECT t3.lo_range, t4.hi_range
          FROM x4 AS t3, x3 AS t4
         WHERE t3.lo_range <= t4.hi_range
           AND t1.lo_range =  t3.lo_range
           AND t2.hi_range >  t4.hi_range
       );

The main part of this query occurs twice and generates all pairs of rows where the start of the range is less than or equal to the end of the range (equal allows for 'ranges' consisting of one value on its own, with deleted rows on either side). The NOT EXISTS clause ensures that there is no other pair with the same start value and a smaller end value.

The queries on the temp tables may not be very fast if there are many gaps in the data; if there are very few gaps, then they should be OK.

The last query exhibits quadratic behaviour in terms of the number of ranges. When I had just a dozen ranges, it was fine (sub-second response time); when I had 1,200 ranges, it was not OK - did not complete in a reasonable time.

Avoiding quadratic behaviour

Since quadratic behaviour is not good, how can we rephrase the query...

For each low end of the range, find the minimum high end of a range that is greater than or equal to the low end, or in SQL:

SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range;

Note that this can easily be incorporated into an ACE report. It gives you the ranges of number present - not those which are absent. You can work out how to generate the other.

Timing

That performed pretty well on a table with 22100 rows containing 1200 gaps in the data. Using (my) SQLCMD program in its benchmark mode (-B), and sending SELECT output to /dev/null, and using IDS 11.70.FC1 run on MacOS X 10.6.7 (MacBook Pro, Intel Core 2 Duo at 3 GHz and 4 GB RAM), the results were:

$ sqlcmd -d stores -B -f gaps.sql
+ CLOCK START;
2011-03-31 18:44:39
+ BEGIN;
Time: 0.000588
2011-03-31 18:44:39
+ SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
  FROM tbl AS t1, OUTER tbl AS t2
 WHERE t1.tbl_id + 1 = t2.tbl_id
  INTO TEMP x1;
Time: 0.437521
2011-03-31 18:44:39
+ SELECT t1.tbl_id AS tbl_id, t2.tbl_id AS ind
   FROM tbl AS t1, OUTER tbl AS t2
  WHERE t1.tbl_id - 1 = t2.tbl_id
   INTO TEMP x2;
Time: 0.315050
2011-03-31 18:44:39
+ SELECT tbl_id AS hi_range
  FROM x1
 WHERE ind IS NULL
  INTO TEMP x3;
Time: 0.012510
2011-03-31 18:44:39
+ SELECT tbl_id AS lo_range
  FROM x2
 WHERE ind IS NULL
  INTO TEMP x4;
Time: 0.008754
+ output "/dev/null";
2011-03-31 18:44:39
+ SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range;
Time: 0.561935
+ output "/dev/stdout";
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x1;
22100
Time: 0.001171
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x2;
22100
Time: 0.000685
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x3;
1200
Time: 0.000590
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x4;
1200
Time: 0.000768
2011-03-31 18:44:40
+ SELECT t1.lo_range, MIN(t2.hi_range) AS hi_range
  FROM x4 AS t1, x3 AS t2
 WHERE t2.hi_range >= t1.lo_range
 GROUP BY t1.lo_range
 INTO TEMP x5;
Time: 0.529420
2011-03-31 18:44:40
+ SELECT COUNT(*) FROM x5;
1200
Time: 0.001155
2011-03-31 18:44:40
+ ROLLBACK;
Time: 0.329379
+ CLOCK STOP;
Time: 2.202523
$

It will do; less than a couple of seconds processing time.

回复收藏 0 原文