有效地为集合中的每个类别选择顶行

发布于 2024-09-04 05:23:07 字数 1402 浏览 2 评论 0原文

我需要从已知的集合中为每个类别选择顶行（有点类似于这个问题）。问题是，如何使这个查询在大量行上高效。

例如，让我们创建一个表，在多个位置存储温度记录。

CREATE TABLE #t (
    placeId int,
    ts datetime,
    temp int,
    PRIMARY KEY (ts, placeId)
)

-- insert some sample data

SET NOCOUNT ON

DECLARE @n int, @ts datetime
SELECT @n = 1000, @ts = '2000-01-01'

WHILE (@n>0) BEGIN
    INSERT INTO #t VALUES (@n % 10, @ts, @n % 37)
    IF (@n % 10 = 0) SET @ts = DATEADD(hour, 1, @ts)
    SET @n = @n - 1
END

现在我需要获取位置 1、2、3 中每个位置的最新记录。

这种方式很有效，但扩展性不好（而且看起来很脏）。

SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

下面的代码看起来更好，但工作效率却低得多（根据优化器的说法，分别为 30% 和 70%）。

SELECT placeId, ts, temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

问题是，在后一个查询执行计划期间，对 #t 执行聚集索引扫描，检索、排序、编号和过滤 300 行，只留下 3 行。对于前一个查询三次，会获取一行。

有没有一种方法可以在没有大量联合的情况下有效地执行查询？

原文

I need to select a top row for each category from a known set (somewhat similar to this question). The problem is, how to make this query efficient on the large number of rows.

For example, let's create a table that stores temperature recording in several places.

CREATE TABLE #t (
    placeId int,
    ts datetime,
    temp int,
    PRIMARY KEY (ts, placeId)
)

-- insert some sample data

SET NOCOUNT ON

DECLARE @n int, @ts datetime
SELECT @n = 1000, @ts = '2000-01-01'

WHILE (@n>0) BEGIN
    INSERT INTO #t VALUES (@n % 10, @ts, @n % 37)
    IF (@n % 10 = 0) SET @ts = DATEADD(hour, 1, @ts)
    SET @n = @n - 1
END

Now I need to get the latest recording for each of the places 1, 2, 3.

This way is efficient, but doesn't scale well (and looks dirty).

SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

The following looks better but works much less efficiently (30% vs 70% according to the optimizer).

SELECT placeId, ts, temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

The problem is, during the latter query execution plan a clustered index scan is performed on #t and 300 rows are retrieved, sorted, numbered, and then filtered, leaving only 3 rows. For the former query three times one row is fetched.

Is there a way to perform the query efficiently without lots of unions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光病人 2024-09-11 05:23:08

不要只看执行计划，还要看statistics io和statistics time

set statistics io on
go
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

SELECT placeId,  temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

set statistics io off
go

表'#t000000000B99'。扫描计数 3、逻辑读取 6、物理读取 0、预读读取 0、lob 逻辑读取 0、lob 物理读取 0、lob 预读读取 0。
表“#t000000000B99”。扫描计数 1，逻辑读取 6，物理读取 0，预读读取 0，lob 逻辑读取 0，lob 物理读取 0，lob 预读读取 0。

set statistics time on
go
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

SELECT placeId,  temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

set statistics time on
go

对我来说，这两种方法之间没有真正的区别，加载更多数据并再次比较

另外，当您向两个查询添加 order by 时，它会下降到 40% vs 60%

SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3
ORDER BY placeId

SELECT placeId,  temp FROM (
    SELECT placeId,  temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1
ORDER BY placeId

don't just look at the execution plan also look at statistics io and statistics time

set statistics io on
go
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

SELECT placeId,  temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

set statistics io off
go

Table '#t000000000B99'. Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t000000000B99'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

set statistics time on
go
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3

SELECT placeId,  temp FROM (
    SELECT placeId, ts, temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1

set statistics time on
go

To me there is no real difference between the 2 methods, load up more data and compare again

ALso when you add an order by to both queries it drops to 40% vs 60%

SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 1
    ORDER BY ts DESC
) t1
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 2
    ORDER BY ts DESC
) t2
UNION ALL
SELECT * FROM (
    SELECT TOP 1 placeId, temp
    FROM #t 
    WHERE placeId = 3
    ORDER BY ts DESC
) t3
ORDER BY placeId

SELECT placeId,  temp FROM (
    SELECT placeId,  temp, ROW_NUMBER() OVER (PARTITION BY placeId ORDER BY ts DESC) rownum
    FROM #t
    WHERE placeId IN (1, 2, 3)
) t
WHERE rownum = 1
ORDER BY placeId

回复收藏 0 原文

天暗了我发光 2024-09-11 05:23:08

我加载了 100,000 行（这仍然不足以减慢速度），尝试了老式方法：

select t.*
 from #t t
  inner join (select placeId, max(ts) ts
               from #t
               where placeId in (1,2,3)
               group by placeId) xx
   on xx.placeId = t.placeId
    and xx.ts = t.ts

并得到了大致相同的结果。

然后，我反转了索引中列的顺序，并且

CREATE TABLE #t ( 
    placeId int, 
    ts datetime, 
    temp int, 
    PRIMARY KEY (placeId, ts) 
)

在所有查询中，获得了更少的页面读取和索引查找而不是扫描。

如果优化是您的目标并且您可以修改索引，我会修改主键，或者添加覆盖索引。

I loaded 100,000 rows (which still wasn't enough to slow things down), tried the old-fashioned way:

select t.*
 from #t t
  inner join (select placeId, max(ts) ts
               from #t
               where placeId in (1,2,3)
               group by placeId) xx
   on xx.placeId = t.placeId
    and xx.ts = t.ts

and got much the same results.

I then reversed the order of the columns in the index, to

CREATE TABLE #t ( 
    placeId int, 
    ts datetime, 
    temp int, 
    PRIMARY KEY (placeId, ts) 
)

and, in all the queries, got fewer page reads and index seeks instead of scans.

If optimization is your goal and you can modify indexes, I'd revised the primary key, or perhaps add a covering index.

回复收藏 0 原文

獨角戲 2024-09-11 05:23:08

仅供记录，使用 CROSS APPLY 的另一个选项。
在我的配置中，它的性能比前面提到的所有配置都要好。

SELECT *
FROM (VALUES (1),(2),(3)) t (placeId)
CROSS APPLY (
    SELECT TOP 1 ts, temp
    FROM #t 
    WHERE placeId = t.placeId
    ORDER BY ts DESC
) tt

我猜想，VALUES 可以被替换为临时表或表变量，没有太大区别。

Just for the record, another option using CROSS APPLY.
On my configuration, it performs better than all previously mentioned ones.

SELECT *
FROM (VALUES (1),(2),(3)) t (placeId)
CROSS APPLY (
    SELECT TOP 1 ts, temp
    FROM #t 
    WHERE placeId = t.placeId
    ORDER BY ts DESC
) tt

I'd guess, VALUES could be chaged to a temp table or a table variable without much difference.

回复收藏 0 原文

~没有更多了~