连接与相关存在子查询

发布于 2024-10-07 12:30:53 字数 564 浏览 6 评论 0原文

在 Derby 和/或 MySQL 中，我想知道以下性能：

select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  and exists (select * from contentblob c1 where c1.documentid = c0.documentid 
              and c1.blobid = ?)
  order by c0.documentid

请注意，将为两个 blobid 比较提供不同的值。此查询的作用是返回表中（至少）两次出现的 id：一次使用 blobid1，一次使用 blobid2。

与等效的自连接相比，

当然，假设 contentblob 表的 documentid 和 blobid 列上有索引。

如果不清楚，这里的相关模式是一个包含两个 varchar 列（documentid 和 blobid）的表。它们组合起来形成主键。

如果除了查询数据库中的查询分析器之外别无他法，我会这样做。

原文

In Derby and/or MySQL, I am wondering about the performance of:

select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  and exists (select * from contentblob c1 where c1.documentid = c0.documentid 
              and c1.blobid = ?)
  order by c0.documentid

Note that DIFFERENT values will be supplied for the two blobid comparisons. What this query does is return the id's that occur in the table (at least) twice: once with blobid1 and once with blobid2.

compared to the equivalent self-join,

assuming, of course, indices on the documentid and blobid columns of the contentblob table.

In case it isn't clear, the relevant schema here is a table with two varchar columns, documentid and blobid. They are composed to form the primary key.

If there's nothing for it but to inquire of the query analyzer in the db, I'll go do that.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晨曦慕雪 2024-10-14 12:30:53

一般来说，相关子查询比联接（甚至是外部自联接）更昂贵，但在这种情况下，您所需要的只是常规的内部自联接：

SELECT c0.documentid
  FROM contentblob AS c0
  JOIN contentblob AS c1 ON c1.documentid = c0.documentid
 WHERE c0.documentid > ? 
   AND c0.blobid = ?
   AND c1.blobid = ?
   AND c0.blobid < c1.blobid
 ORDER BY c0.documentid

最后一个条件避免看到具有相同文档 ID 的一对行的两个条目以及相关的 blob ID 值；如果没有它，您将看到文档 ID 两次，一次为 c0.blobid = v1，一次为 c0.blobid = v2。

您可能希望将 c1.blobid 条件移至 ON 子句中，但优化器无论如何都应该为您执行此操作。还要检查您的优化器是否使用适当的索引。目前尚不完全清楚该索引是什么（但我认为以 blobid 作为前导列的索引可能性能最佳），但如果表很大并且优化器使用表的顺序扫描，则会遇到问题。

Generally, correlated sub-queries are more expensive than joins, even outer self-joins, though in this case all you need is a regular inner self-join:

SELECT c0.documentid
  FROM contentblob AS c0
  JOIN contentblob AS c1 ON c1.documentid = c0.documentid
 WHERE c0.documentid > ? 
   AND c0.blobid = ?
   AND c1.blobid = ?
   AND c0.blobid < c1.blobid
 ORDER BY c0.documentid

The last condition avoids seeing two entries for a pair of rows with the same document ID and the relevant blob ID values; without it, you'd see the document ID twice, once for c0.blobid = v1 and once for c0.blobid = v2.

You might want to move the c1.blobid condition into the ON clause, but the optimizer should do that for you anyway. Also check that your optimizer is using an appropriate index. It isn't entirely clear what that index would be (but I think an index with blobid as the leading column would probably perform best), but if the tables are large and the optimizer is using a sequential scan of the table, you have problems.

回复收藏 0 原文

初吻给了烟 2024-10-14 12:30:53

我认为左连接会更清晰

  select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  left join contentblob c1 on c0.documentid = c1.documentid 
       and c1.blobid = ? 
  order by c0.documentid

里程会因不同的数据库引擎而异，并且也会随表大小、索引基数等而变化。但我的直觉是连接会执行得更好。

I think the left join would be clearer

  select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  left join contentblob c1 on c0.documentid = c1.documentid 
       and c1.blobid = ? 
  order by c0.documentid

Mileage will vary for different database engines, and, will also vary with table sizes, index cardinality etc. but my gut feeling is the join would perform better.

回复收藏 0 原文

~没有更多了~