Sybase高手求助:groupby聚合性能问题
嘿,我有以下表格和 SQL:
T1:ID,col2,col3 - PK(ID) - 2300 万行
T2:ID,col2,col3 - PK(ID) - 2300 万行
T3:ID,名称,值 - PK(ID ,name) -66mil rows
1) 下面的sql 非常快地返回10k 行结果集,没有问题。
select top 10000 T1.col2, T2.col2, T3.name, T4.value
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
2)下面的sql花了很长时间。
select top 10000 T1.col2, T2.col2,
ABC = min(case when T3.name='ABC ' then T3.value end)
XYZ = min(case when T3.name='XYZ ' then T3.value end)
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
group by T1.col2, T2.col2,
这 2 个查询之间的显示计划的唯一区别是下面的查询 2)。我不能 100% 理解它,它是否选择将没有前 10000 个的整个结果集放入临时表中,然后对其进行分组?这就是它慢的原因吗?
STEP 1
The type of query is SELECT (into Worktable1).
GROUP BY
Evaluate Grouped MINIMUM AGGREGATE.
FROM TABLE ...etc..
TO TABLE
Worktable1.
STEP 2
The type of query is SELECT.
FROM TABLE
Worktable1.
Nested iteration.
Table Scan.
Forward scan.
Positioning at start of table.
Using I/O Size 16 Kbytes for data pages.
With MRU Buffer Replacement Strategy for data pages.
我的问题是
1) 为什么查询 2) 这么慢
2) 如何修复,同时保持查询逻辑相同,最好将其限制为像以前一样只选择 1 个 SQL。
谢谢
Hey I have the following tables and SQL:
T1: ID, col2,col3 - PK(ID) - 23mil rows
T2: ID, col2,col3 - PK(ID) - 23mil rows
T3: ID, name,value - PK(ID,name) -66mil rows
1) The below sql returns back the 10k row resultset very fast, no problems.
select top 10000 T1.col2, T2.col2, T3.name, T4.value
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
2) The below sql took FOREVER.
select top 10000 T1.col2, T2.col2,
ABC = min(case when T3.name='ABC ' then T3.value end)
XYZ = min(case when T3.name='XYZ ' then T3.value end)
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
group by T1.col2, T2.col2,
The only difference in the showplan between those 2 queries are the below for query 2). I dont understand it 100%, is it selecting the ENTIRE resultset WITHOUT top 10000 into the temp table then doing a group by on it? is that why it's slow?
STEP 1
The type of query is SELECT (into Worktable1).
GROUP BY
Evaluate Grouped MINIMUM AGGREGATE.
FROM TABLE ...etc..
TO TABLE
Worktable1.
STEP 2
The type of query is SELECT.
FROM TABLE
Worktable1.
Nested iteration.
Table Scan.
Forward scan.
Positioning at start of table.
Using I/O Size 16 Kbytes for data pages.
With MRU Buffer Replacement Strategy for data pages.
My question is
1) Why is query 2) so slow
2) How do I fix while keeping the query logic the same and preferably limit it to just 1 select SQL as before.
thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
虽然可能是一个通用的答案,但我会说在分组所依据的列上放置一个索引。
编辑/修改:这是我重新审视该问题后的理论。查询中的 SELECT 语句始终是最后执行的一行。这是有道理的,因为它是从下面指定的数据集中检索所需值的语句。在您的查询中,将针对您指定的 MIN 值表达式评估整个数据集(数百万条记录)。由于您在 select 语句中指定了两个 MIN 列,因此将在整个数据集上调用两个单独的函数。 过滤数据集并确定 MIN 列后,将选择前 10000 行。
简而言之,您正在对数百万条记录执行两个数学函数。这将花费大量时间,尤其是在没有索引的情况下。
您的解决方案是使用派生表。我还没有编译下面的代码,但它与您将使用的代码很接近。它只会取 10,000 条记录的最小值,而不是整个数据集。
IE
Although possibly a generic answer, I'd say to put a index on the columns you're grouping by.
Edit / Revise: Here's my theory after re-looking at the issue. The SELECT statement in a query is always the last line executed. This makes sense as it is the statement that retrieves the values you want from the dataset specified below. In your query, the whole dataset (millions of records) will be evaluated for the MIN value expression that you specified. There will be two seperate functions called on the entire dataset, since you have specified two MIN columns in the select statement. After the dataset is filtered and the MIN columns have been determined, the top 10000 rows will then be selected.
In a nutshell, you're doing two mathematical function on millions of records. This will take a significant amount of time, especially with no indexes.
The solution for you would be to use a derived table. I haven't compiled the code below, but it's something close to what you would use. It will only take the min values of the 10,000 records rather than the whole dataset.
I.e.