使用 UNION 子查询的查询需要很长时间
我在某些依赖于子查询的查询上遇到了一个奇怪的问题。它们运行得快如闪电,直到我在子查询中使用 UNION 语句。然后他们就无休无止地跑,我10分钟后就给了。我现在描述的场景不是我一开始的场景,但我认为它消除了很多可能的问题,但却产生了同样的问题。因此,即使这是一个毫无意义的问题,请耐心等待!
我有一个表:
tblUser - 100,000 rows
tblFavourites - 200,000 rows
如果我执行:
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser);
...那么它会在一秒钟内运行。但是,如果我修改它以使子查询具有 UNION,它将运行至少 10 分钟(在我放弃之前!)这是
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser UNION SELECT uid FROM tblUser);
一个毫无意义的更改,但它应该产生相同的结果,我不明白为什么它应该需要更长的时间吗?
将子查询放入视图中并调用它具有相同的效果。
有什么想法吗?我正在使用 SQL Azure。
问题解决了。请参阅下面我的回答。
I've been having an odd problem with some queries that depend on a sub query. They run lightning fast, until I use a UNION statement in the sub query. Then they run endlessly, I've given after 10 minutes. The scenario I'm describing now isn't the original one I started with, but I think it cuts out a lot of possible problems yet yields the same problem. So even though it's a pointless query, bear with me!
I have a table:
tblUser - 100,000 rows
tblFavourites - 200,000 rows
If I execute:
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser);
… then it runs in under a second. However, if I modify it so that the sub query has a UNION, it will run for at least 10 minutes (before I give up!)
SELECT COUNT(*)
FROM tblFavourites
WHERE userID NOT IN (SELECT uid FROM tblUser UNION SELECT uid FROM tblUser);
A pointless change, but it should yield the same result and I don't see why it should take any longer?
Putting the sub-query into a view and calling that instead has the same effect.
Any ideas why this would be? I'm using SQL Azure.
Problem solved. See my answer below.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
UNION
实际上是对组合数据集中的所有字段执行DISTINCT
。它会过滤掉最终结果中的重复内容。Uid
是否已编入索引?如果不是,则可能需要很长时间,因为查询引擎:如果不关心重复项(并使用 < code>IN 意味着它们不会),然后使用
UNION ALL
来删除昂贵的排序/过滤步骤。UNION
is really doing aDISTINCT
on all fields in the combined data set. It filters out dupes in the final results.Is
Uid
indexed? If not it may take a long time as the query engine:If duplicates aren't a concern (and using
IN
means they won't be) then useUNION ALL
which removes the expensive Sort/Filter step.UNION 生成唯一值,因此 DBMS 引擎进行排序。
在这种情况下,您可以安全地使用 UNION ALL。
UNION generate unique values, so the DBMS engine makes sorts.
You can use safely UNION ALL in this case.
UNION 通常通过内存中的临时表来实现。您实际上是将 tblUser 复制到内存中两次,没有索引。那么 tblFavourites 中的每一行都会引发超过 200,000 行的完整表扫描 - 即 200Kx200K=400 亿次双行扫描(因为查询引擎必须从两个表行获取 uid)
如果您的 tblUser 在 uid 上有索引(这绝对是正确的)因为 SQL Azure 中的所有表都必须有聚集索引),那么 tblFavourites 中的每一行都会产生非常快速的索引查找,导致仅 200Kxlog(100K) =200Kx17 = 200K 行扫描,每个扫描有 17 个 b 树索引比较(这比从数据页上的行读取 uid 快得多),因此它应该等于大约 200Kx(3- 4) 或100万次双行扫描。我相信较新版本的 SQL Server 也可能会构建一个仅包含 uid 的临时哈希表,因此本质上它会减少到 200K 行扫描(假设哈希表查找很简单)。
您还应该生成查询计划进行检查。
本质上,如果 tblUser 有索引(应该位于 SQL Azure 上),非 UNION 查询的运行速度会快大约 500,000 倍。
UNION's are usually implemented via temporary in-memory tables. You're essentially copying your tblUser two times into memory, WITH NO INDEX. Then every row in tblFavourites incur a complete table scan over 200,000 rows - that's 200Kx200K=40 billion double-row scans (because the query engine must get the uid from both table rows)
If your tblUser has an index on uid (which is definitely true because all tables in SQL Azure must have a clustered index), then each row in tblFavourites incurs a very fast index lookup, resulting in only 200Kxlog(100K) =200Kx17 = 200K row scans, each with 17 b-tree index comparisons (which is much faster than reading the uid from a row on a data page), so it should equate to roughly 200Kx(3-4) or 1 million double-row scans. I believe newer versions of SQL server may also build a temp hash table containing just the uid's, so essentially it gets down to 200K row scans (assuming hash table lookup to be trivial).
You should also generate your query plan to check.
Essentially, the non-UNION query runs around 500,000 times faster if tblUser has an index (should be on SQL Azure).
事实证明,问题是由于索引之一... tblFavourites 包含 tblUser 中主键(uid)的两个外键:
两列具有相同的定义和相同的索引,但我发现将 userId 交换为 otherUserId原始查询解决了问题。
我跑了:
……然后问题就消失了。现在查询几乎立即执行。
我不太了解 Sql Server/Azure 幕后发生的事情......但我只能想象它是损坏的索引或其他什么?我经常更新统计数据,但这没有效果。
谢谢!
---- 更新
以上并不完全正确。它确实解决了大约 20 分钟的问题,然后又恢复了。我已经与 Microsoft 支持人员联系了几天,问题似乎与 tempDB 有关。他们最终正在研究解决方案。
It turns out the problem was due to one of the indexes ... tblFavourites contained two foreign keys to the primary key (uid) in tblUser:
both columns had the same definition and same indexes, but I discovered that swapping userId for otherUserId in the original query solved the problem.
I ran:
... and the problem went away. The query now executes almost instantly.
I don't know too much about what goes on behind the scenes in Sql Server/Azure ... but I can only imagine that it was a damaged index or something? I update statistics frequently, but that had no effect.
Thanks!
---- UPDATE
The above was not fully correct. It did fix the problem for around 20 minutes, then it returned. I have been in touch with Microsoft support for several days and it seems the problem is to do with the tempDB. They are working on a solution at their end.
我刚刚遇到这个问题。我有大约 100 万行需要检查,然后我意识到我的一些 ID 位于另一个表中,因此我联合起来在一个“不存在”中获取相同的信息。我从花费大约 7 秒的查询变成在一分钟左右后仅处理 5000 行。这似乎有帮助。我绝对讨厌这个解决方案,但我尝试了很多方法,但最终都以同样极其缓慢的执行计划告终。这个在大约 18 秒内就满足了我的需要。
(是的,我尝试了“WHERE EXISTS IN...”作为最终选择...内连接速度更快)
请让我再说一遍,我个人觉得这真的很难看,但我实际上在我的过程中使用了这个 join 两次,所以从长远来看这会节省我的时间。希望这有帮助。
I just ran into this problem. I have about 1million rows to go through and then I realized that some of my IDs were in another table, so I unioned to get the same information in one "NOT EXISTS." I went from the query taking about 7 sec to processing only 5000 rows after a minute or so. This seemed to help. I absolutely hate the solution, but I've tried a multitude of things that all end up w/the same extremely slow execution plan. This one got me what I needed in about 18 sec.
(And yes I tried "WHERE EXISTS IN..." for the final select... inner join was faster)
Please let me say again, I personaly feel this is really ugly, but I actually use this join twice in my proc, so it's going to save me time in the long run. Hope this helps.
是否更有意义?
将问题从“不在该表和/或该表中出现的所有 Id 的组合列表中的用户 ID”
改写为
“用户 ID” 不在这张桌子上并且也不在那张桌子上
Doesn't it make more sense to rephrase the questions from
"UserIds that aren't on the combined list of all the Ids that apper in this table and/or that table"
to
"UserIds that aren't on this table AND aren't on that table either