高效地从 2 类表中查找已删除的记录

发布于 2024-08-16 19:50:24 字数 621 浏览 10 评论 0原文

我有一个类型 2 数据的数据库表,我想查找自上次与其同步以来删除的记录。它有 date_fromdate_to 列,原始数据有一个 ID 列 object_iddate_to<>null 表示它现在不存在,因此如果没有其他记录具有相同的 object_iddate_to=null,则它已被删除。

我相信一个幼稚的实现会是这样的:

select * from data_t2 a
where a.date_to > last_sync_date and a.date_to < current_date()
and not exists (select * from data_t2 b
                where b.date_to is null and b.object_id = a.object_id);

但显然这将是非常昂贵的。

我是否缺少一种明显更有效的方法?我怀疑没有(或者更确切地说,我应该假设删除的记录相对较少,并在 RDBMS 之外进行一些计算),但我想我会问以防万一。

谢谢!

I've got a database table of type-2 data, and I want to find records that were deleted since I last synced with it. It's got date_from and date_to columns, and the raw data has an ID column object_id. date_to<>null means it doesn't exist now, so if there's no other record with the same object_id and date_to=null, then it's been deleted.

I believe a naive implementation would be something like:

select * from data_t2 a
where a.date_to > last_sync_date and a.date_to < current_date()
and not exists (select * from data_t2 b
                where b.date_to is null and b.object_id = a.object_id);

but obviously that's going to be ridiculously expensive.

Is there an obvious more efficient way that I'm missing? I suspect there isn't (or rather, that I should assume there are relatively few deleted records, and do some of the computation outside the RDBMS), but I figured I'd ask just in case.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

病毒体 2024-08-23 19:50:24

在开始调整查询之前,您确实应该运行 EXPLAIN 或应用一些其他诊断。否则您无法看到重写如何更改计划

您可以使用外连接重写它。例如,在 MySQL 中,这将比子查询快得多:

SELECT    * 
FROM      data_t2 a
LEFT JOIN data_t2 b
ON        a.object_id = b.object_id
AND       b.date_to IS NULL
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL

如果维度表非常大,并且有一个以 date_to 作为第一列的索引,并且 date_to IS NULL 的行数只是整个索引的一小部分表,这可能会更快:

SELECT    * 
FROM      data_t2 a
LEFT JOIN (
          SELECT object_id
          FROM   data_t2 b
          WHERE  b.date_to IS NULL
          )
ON        a.object_id = b.object_id
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL

Before you start tuning the query, you really should run EXPLAIN or apply some other diagnostics. Otherwise you cannot see how a rewrite changes the plan

You can rewrite this with an outer join. In for example MySQL, this will be much faster than the subquery:

SELECT    * 
FROM      data_t2 a
LEFT JOIN data_t2 b
ON        a.object_id = b.object_id
AND       b.date_to IS NULL
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL

If the dimension table is really large, and there is an index that has date_to as first column, and the number of rows having date_to IS NULL is a small fraction of the entire table, this might be faster still:

SELECT    * 
FROM      data_t2 a
LEFT JOIN (
          SELECT object_id
          FROM   data_t2 b
          WHERE  b.date_to IS NULL
          )
ON        a.object_id = b.object_id
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文