获取每个组的顶部1行

发布于 2025-02-11 02:01:31 字数 2436 浏览 5 评论 0 原文

我有一个表,我想为每个组获得最新条目。这是表:

document statuslogs

ID documentID 状态 Datecreated
2 1 S1 7/29/2011
3 1 S2 7/30/2011
6 1 S1 8/02/2011
1 2 S1 7/28/2011
4 2 S2 7/30/2011
5 2 S3 8/01/2011
6 3 S1 8/02/2011

该表将由 docuct> document> documentID ,并由 datecreated 在下降时进行分类。命令。对于每个 documentID ,我想获得最新的状态。

我首选的输出:

DocumentID 状态 DateCreated
1 S1 8/02/2011
2 S3 8/01/2011
3 S1 8/02/2011
  • 是否有任何聚合功能可以从每个组中获得顶部?请参阅pseudo-code getonlythetop 下面:

     选择
        DocumentID,
        getonlythetop(状态),
        getonlythetop(dateCreated)
      来自DocumentStatusLogs
      documentID的组
      通过DateCreated Desc订购
     
  • 如果不存在此类功能,我有什么办法可以实现我想要的输出?

  • 或首先,这可能是由未正常的数据库引起的吗?我在想,既然我要寻找的只是一行,那么状态也应该位于父表中吗?

表以获取更多信息:

当前文档

documentID 标题 ... dateCreated datecreated
1 titlea ... ...
2 titleb ... ...
3 titlec ... ...

请参阅父 表是这样,以便我可以轻松地访问其状态?

DocumentID 标题 内容 DateCreated CurrentStatus
1 Titlea ... ... S1
2 标题B S3 S1 3
Titlec ... ... ...

更新 我刚刚学会了如何使用“应用”,这使得更容易解决此类问题。

I have a table which I want to get the latest entry for each group. Here's the table:

DocumentStatusLogs Table

ID DocumentID Status DateCreated
2 1 S1 7/29/2011
3 1 S2 7/30/2011
6 1 S1 8/02/2011
1 2 S1 7/28/2011
4 2 S2 7/30/2011
5 2 S3 8/01/2011
6 3 S1 8/02/2011

The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.

My preferred output:

DocumentID Status DateCreated
1 S1 8/02/2011
2 S3 8/01/2011
3 S1 8/02/2011
  • Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:

      SELECT
        DocumentID,
        GetOnlyTheTop(Status),
        GetOnlyTheTop(DateCreated)
      FROM DocumentStatusLogs
      GROUP BY DocumentID
      ORDER BY DateCreated DESC
    
  • If such function doesn't exist, is there any way I can achieve the output I want?

  • Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

Please see the parent table for more information:

Current Documents Table

DocumentID Title Content DateCreated
1 TitleA ... ...
2 TitleB ... ...
3 TitleC ... ...

Should the parent table be like this so that I can easily access its status?

DocumentID Title Content DateCreated CurrentStatus
1 TitleA ... ... s1
2 TitleB ... ... s3
3 TitleC ... ... s1

UPDATE
I just learned how to use "apply" which makes it easier to address such problems.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(17

神妖 2025-02-18 02:01:31
WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

如果您期望每天2个条目,那么这将任意选择一个。要获得这两个条目一天,请使用dense_rank代替row_number。

至于归一化,是否要:是否要:

  • 在2个位置保持状态
  • 保存状态历史记录
  • ...

如它所处的状态,您保留了状态历史记录。如果您也希望在父表中(这是非正式化)中的最新状态,则需要一个触发器来维持父母的状态。或丢弃此状态历史记录表。

WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead of ROW_NUMBER.

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

谁的新欢旧爱 2025-02-18 02:01:31

我刚刚学会了如何使用交叉应用。这是在这种情况下使用它的方法:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

I just learned how to use cross apply. Here's how to use it in this scenario:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds
︶ ̄淡然 2025-02-18 02:01:31

我知道这是一个旧的线程,但是带有领带的顶级1 解决方案非常好,可能有助于阅读解决方案。

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

选择带绑带的顶部1 子句告诉SQL Server您要返回每个组的第一行。但是SQL Server如何知道如何将数据分组?这是row_number()上的订单(通过DateCreated desc 通过DocumentID订单进行分区的分区。 根据列进行排序。

在每个组中,将 href =“ https://learn.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql” rel =“ noreferrer”>在这里。

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.

More about the TOP clause can be found here.

多像笑话 2025-02-18 02:01:31

我在这里对各种建议进行了一些时间,结果确实取决于所涉及的表的大小,但是最一致的解决方案是使用交叉应用这些测试,这些测试是针对SQL Server 2008-R2,使用带有的表格6,500个记录,另一个(相同的模式),有1.37亿记录。要查询的列是表格上主要键的一部分,并且表宽度很小(约30个字节)。 SQL Server从实际执行计划报告了时间。

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

我认为,真正令人惊讶的是,无论涉及的行数量如何,十字架的时间都有多大的一致性。

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.

人生百味 2025-02-18 02:01:31

如果您担心性能,也可以使用 max()

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

row_number()需要您选择中的所有行代码>语句,而 max 没有。应该大大加快查询。

If you're worried about performance, you can also do this with MAX():

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.

初与友歌 2025-02-18 02:01:31

这是关于该主题的最容易找到的问题之一,因此我想对IT给出现代答案(既供我的参考和帮助他人)。通过使用 first_value 可以简要介绍上述查询:

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

这应该在SQL Server 2012及以上。 first_value 可以将其视为完成在使用子句时选择时选择顶部1 的方法。 Over 允许在选择列表中进行分组,因此不要编写嵌套的子查询(就像许多现有答案一样),而是以更可读性的方式做到这一点。

This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

This should work in SQL Server 2012 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion.

分開簡單 2025-02-18 02:01:31
SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

什么数据库服务器?该代码对所有代码都不适用。

关于您的问题的后半部分,对我来说似乎是合理的。您可以将 DocumentStatusLogs 作为日志,但仍将最新信息存储在主表中。

顺便说一句,如果您已经在文档表中拥有 dateCreated 列,则可以使用该表(只要 dateCreated ,代码> documentStatusLogs )。

编辑:MSSQL不支持使用,因此将其更改为:

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

What database server? This code doesn't work on all of them.

Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.

BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).

Edit: MsSQL does not support USING, so change it to:

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
如歌彻婉言 2025-02-18 02:01:31

这是三种单独解决问题的方法,以及每个查询的最佳索引选择(请自己尝试索引,查看逻辑上的阅读,经过的时间,执行计划。我从经验上提供了建议此类查询无执行此特定问题)。

方法1 :使用row_number()。如果Rowstore索引无法提高性能,则可以尝试使用非集合/聚类的列店索引,例如具有聚合和分组的查询以及所有时间都在不同列中订购的表,Columnstore Index通常是最佳选择。

;WITH CTE AS
    (
       SELECT   *,
                RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
       FROM     DocumentStatusLogs
    )
    SELECT  ID      
        ,DocumentID 
        ,Status     
        ,DateCreated
    FROM    CTE
    WHERE   RN = 1;

方法2 :使用first_value。如果Rowstore索引无法提高性能,则可以尝试使用非集合/聚类的列店索引,例如具有聚合和分组的查询以及所有时间都在不同列中订购的表,Columnstore Index通常是最佳选择。

SELECT  DISTINCT
    ID      = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DocumentID
    ,Status     = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DateCreated    = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM    DocumentStatusLogs;

方法3 :使用交叉应用。在覆盖查询中使用的列的DocumentStatusLogs表上创建Rowstore索引应足以覆盖查询,而无需列仓库索引。

SELECT  DISTINCT
    ID      = CA.ID
    ,DocumentID = D.DocumentID
    ,Status     = CA.Status 
    ,DateCreated    = CA.DateCreated
FROM    DocumentStatusLogs D
    CROSS APPLY (
            SELECT  TOP 1 I.*
            FROM    DocumentStatusLogs I
            WHERE   I.DocumentID = D.DocumentID
            ORDER   BY I.DateCreated DESC
            ) CA;

Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).

Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.

;WITH CTE AS
    (
       SELECT   *,
                RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
       FROM     DocumentStatusLogs
    )
    SELECT  ID      
        ,DocumentID 
        ,Status     
        ,DateCreated
    FROM    CTE
    WHERE   RN = 1;

Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.

SELECT  DISTINCT
    ID      = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DocumentID
    ,Status     = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DateCreated    = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM    DocumentStatusLogs;

Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.

SELECT  DISTINCT
    ID      = CA.ID
    ,DocumentID = D.DocumentID
    ,Status     = CA.Status 
    ,DateCreated    = CA.DateCreated
FROM    DocumentStatusLogs D
    CROSS APPLY (
            SELECT  TOP 1 I.*
            FROM    DocumentStatusLogs I
            WHERE   I.DocumentID = D.DocumentID
            ORDER   BY I.DateCreated DESC
            ) CA;
青萝楚歌 2025-02-18 02:01:31

这是一个古老的线程,但是我想我要把我的两分钱扔给公认的答案对我来说并不特别好。我在一个大数据集中尝试了GBN的解决方案,发现它非常慢(在SQL Server 2012中的500万加唱片中45秒; 45秒)。从执行计划中看,很明显,问题是它需要进行分类操作,从而大大减慢了事情。

这是我从实体框架中提出的替代方法,它不需要排序操作,并且进行了非群集索引搜索。这将执行时间降低到<上述记录集2秒。

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

现在,我假设原始问题中未完全指定的内容,但是如果您的表设计使您的ID列是自动插入ID,并且dateCreated用每个插入设置为当前日期,则不用上面的查询运行,您实际上可以从上订购ID而不是在DateCreated ,因此可以从GBN的解决方案(大约是执行时间的一半)获得可观的性能提升,因为这将提供相同的排序订单和排序订单和这是更快的。

This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.

Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

天煞孤星 2025-02-18 02:01:31

该解决方案可用于获取每个分区的最新行(在示例中,n是1中的n在where语句和分区中是doc_id):

SELECT T.doc_id, T.status, T.date_created FROM 
(
    SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;

This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):

SELECT T.doc_id, T.status, T.date_created FROM 
(
    SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;
素年丶 2025-02-18 02:01:31

从上面验证克林特的出色和正确答案:

下面的两个查询之间的性能很有趣。 52%是最重要的。 48%是第二个。使用独特而不是顺序提高性能的4%。但是订购的优势是按多列排序。

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

选项1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

选项2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

在Microsoft SQL Server Management Studio中:突出显示并运行第一个块后,突出显示了选项1和选项2,右键单击 - &gt; [显示估计的执行计划]。然后运行整个内容以查看结果。

选项1结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

选项2结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

注意:

当我希望加入为1到1(其中1个)时,我倾向于使用应用。

如果我希望加入是一对一对的,或者多对多。

我避免使用row_number(),除非我需要做一些高级的事情,并且可以对窗口的性能处罚。

我还避免在Where或子句中的子征服中存在 /避免,因为我经历了这件事引起了一些可怕的执行计划。但是里程有所不同。查看执行计划和个人资料绩效在需要的地方和何处!

Verifying Clint's awesome and correct answer from above:

The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

Option 1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

Option 2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.

Option 1 Results:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

Option 2 Results:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

Note:

I tend to use APPLY when I want a join to be 1-to-(1 of many).

I use a JOIN if I want the join to be 1-to-many, or many-to-many.

I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.

I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!

◇流星雨 2025-02-18 02:01:31

我的代码从每个组中选择顶级1

select a.* from #DocumentStatusLogs a where 
 datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where 
a.documentid = b.documentid
order by datecreated desc
)

My code to select top 1 from each group

select a.* from #DocumentStatusLogs a where 
 datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where 
a.documentid = b.documentid
order by datecreated desc
)
岁吢 2025-02-18 02:01:31

交叉应用是我用于解决方案的方法,因为它对我有效,也适合我的客户。从我阅读的内容来看,如果它们的数据库大幅增长,应该提供最佳的整体性能。

CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.

謌踐踏愛綪 2025-02-18 02:01:31
SELECT o.*
FROM `DocumentStatusLogs` o                   
  LEFT JOIN `DocumentStatusLogs` b                   
  ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
 WHERE b.DocumentID is NULL ;

如果您只想通过DateCreated返回最近的文档订单,它将仅返回DocumentID的前1个文档

SELECT o.*
FROM `DocumentStatusLogs` o                   
  LEFT JOIN `DocumentStatusLogs` b                   
  ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
 WHERE b.DocumentID is NULL ;

If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID

身边 2025-02-18 02:01:31

在您要避免使用row_count()的方案中,您也可以使用左JOIN:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
left join DocumentStatusLogs filter 
    ON ds.DocumentID = filter.DocumentID
    -- Match any row that has another row that was created after it.
    AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched 
where filter.DocumentID is null 

不在子查询”,通常将其编译为与左JOIN相同的输出:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
    SELECT filter.ID 
    FROM DocumentStatusLogs filter
    WHERE ds.DocumentID = filter.DocumentID
        AND ds.DateCreated < filter.DateCreated)

对于示例架构,您也可以使用“ 如果表至少没有一个单列唯一键/约束/索引,在这种情况下,子查询模式将不起作用。

这两个查询往往比Row_count()查询(按查询分析仪衡量)更“昂贵”。但是,您可能会遇到更快返回结果或启用其他优化结果的方案。

In scenarios where you want to avoid using row_count(), you can also use a left join:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
left join DocumentStatusLogs filter 
    ON ds.DocumentID = filter.DocumentID
    -- Match any row that has another row that was created after it.
    AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched 
where filter.DocumentID is null 

For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
    SELECT filter.ID 
    FROM DocumentStatusLogs filter
    WHERE ds.DocumentID = filter.DocumentID
        AND ds.DateCreated < filter.DateCreated)

Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".

Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.

长伴 2025-02-18 02:01:31

尝试以下操作:

SELECT [DocumentID]
    ,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
    ,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
    SELECT [DocumentID]
        ,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
    FROM DocumentStatusLogs
    GROUP BY DocumentID
    ) AS [tmpQry]

Try this:

SELECT [DocumentID]
    ,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
    ,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
    SELECT [DocumentID]
        ,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
    FROM DocumentStatusLogs
    GROUP BY DocumentID
    ) AS [tmpQry]
迷离° 2025-02-18 02:01:31

一些数据库引擎*开始支持符合条件子句,该子句允许过滤窗口函数的结果(所接受的答案使用)。

因此,接受的答案可以变成

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

本文的深入说明:

您可以使用此工具来查看哪个数据库支持此子句: https://www.jooq.org/translate/
当目标方言不支持时,可以选择转换资格条款。

*Teradata,Bigquery,H2,雪花...

Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).

So the accepted answer can become

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify

You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.

*Teradata, BigQuery, H2, Snowflake...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文