我有一个表,我想为每个组获得最新条目。这是表:
document statuslogs
表
ID |
documentID |
状态 |
Datecreated |
2 |
1 |
S1 |
7/29/2011 |
3 |
1 |
S2 |
7/30/2011 |
6 |
1 |
S1 |
8/02/2011 |
1 |
2 |
S1 |
7/28/2011 |
4 |
2 |
S2 |
7/30/2011 |
5 |
2 |
S3 |
8/01/2011 |
6 |
3 |
S1 |
8/02/2011 |
该表将由 docuct> document> documentID
,并由 datecreated
在下降时进行分类。命令。对于每个 documentID
,我想获得最新的状态。
我首选的输出:
DocumentID |
状态 |
DateCreated |
1 |
S1 |
8/02/2011 |
2 |
S3 |
8/01/2011 |
3 |
S1 |
8/02/2011 |
-
是否有任何聚合功能可以从每个组中获得顶部?请参阅pseudo-code getonlythetop
下面:
选择
DocumentID,
getonlythetop(状态),
getonlythetop(dateCreated)
来自DocumentStatusLogs
documentID的组
通过DateCreated Desc订购
-
如果不存在此类功能,我有什么办法可以实现我想要的输出?
-
或首先,这可能是由未正常的数据库引起的吗?我在想,既然我要寻找的只是一行,那么状态
也应该位于父表中吗?
表以获取更多信息:
当前文档
表
documentID |
标题 |
... |
dateCreated datecreated |
1 |
titlea |
... |
... |
2 |
titleb |
... |
... |
3 |
titlec |
... |
... |
请参阅父 表是这样,以便我可以轻松地访问其状态?
DocumentID |
标题 |
内容 |
DateCreated |
CurrentStatus |
1 |
Titlea |
... |
... |
S1 |
2 |
标题B |
S3 |
S1 |
3 |
Titlec |
... |
... |
... |
|
更新
我刚刚学会了如何使用“应用”,这使得更容易解决此类问题。
I have a table which I want to get the latest entry for each group. Here's the table:
DocumentStatusLogs
Table
ID |
DocumentID |
Status |
DateCreated |
2 |
1 |
S1 |
7/29/2011 |
3 |
1 |
S2 |
7/30/2011 |
6 |
1 |
S1 |
8/02/2011 |
1 |
2 |
S1 |
7/28/2011 |
4 |
2 |
S2 |
7/30/2011 |
5 |
2 |
S3 |
8/01/2011 |
6 |
3 |
S1 |
8/02/2011 |
The table will be grouped by DocumentID
and sorted by DateCreated
in descending order. For each DocumentID
, I want to get the latest status.
My preferred output:
DocumentID |
Status |
DateCreated |
1 |
S1 |
8/02/2011 |
2 |
S3 |
8/01/2011 |
3 |
S1 |
8/02/2011 |
-
Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop
below:
SELECT
DocumentID,
GetOnlyTheTop(Status),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC
-
If such function doesn't exist, is there any way I can achieve the output I want?
-
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status
also be located in the parent table?
Please see the parent table for more information:
Current Documents
Table
DocumentID |
Title |
Content |
DateCreated |
1 |
TitleA |
... |
... |
2 |
TitleB |
... |
... |
3 |
TitleC |
... |
... |
Should the parent table be like this so that I can easily access its status?
DocumentID |
Title |
Content |
DateCreated |
CurrentStatus |
1 |
TitleA |
... |
... |
s1 |
2 |
TitleB |
... |
... |
s3 |
3 |
TitleC |
... |
... |
s1 |
UPDATE
I just learned how to use "apply" which makes it easier to address such problems.
发布评论
评论(17)
如果您期望每天2个条目,那么这将任意选择一个。要获得这两个条目一天,请使用dense_rank代替row_number。
至于归一化,是否要:是否要:
如它所处的状态,您保留了状态历史记录。如果您也希望在父表中(这是非正式化)中的最新状态,则需要一个触发器来维持父母的状态。或丢弃此状态历史记录表。
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead of ROW_NUMBER.
As for normalised or not, it depends if you want to:
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
我刚刚学会了如何使用
交叉应用
。这是在这种情况下使用它的方法:I just learned how to use
cross apply
. Here's how to use it in this scenario:我知道这是一个旧的线程,但是
带有领带的顶级1
解决方案非常好,可能有助于阅读解决方案。选择带绑带的顶部1
子句告诉SQL Server您要返回每个组的第一行。但是SQL Server如何知道如何将数据分组?这是row_number()上的订单(通过DateCreated desc
通过DocumentID订单进行分区的分区。 根据列进行排序。在每个组中,将 href =“ https://learn.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql” rel =“ noreferrer”>在这里。
I know this is an old thread but the
TOP 1 WITH TIES
solutions is quite nice and might be helpful to some reading through the solutions.The
select top 1 with ties
clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where theorder by row_number() over (partition by DocumentID order by DateCreated desc
comes in. The column/columns afterpartition by
defines how SQL Server groups up the data. Within each group, the rows will be sorted based on theorder by
columns. Once sorted, the top row in each group will be returned in the query.More about the TOP clause can be found here.
我在这里对各种建议进行了一些时间,结果确实取决于所涉及的表的大小,但是最一致的解决方案是使用交叉应用这些测试,这些测试是针对SQL Server 2008-R2,使用带有的表格6,500个记录,另一个(相同的模式),有1.37亿记录。要查询的列是表格上主要键的一部分,并且表宽度很小(约30个字节)。 SQL Server从实际执行计划报告了时间。
我认为,真正令人惊讶的是,无论涉及的行数量如何,十字架的时间都有多大的一致性。
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
如果您担心性能,也可以使用
max()
:row_number()
需要您选择
中的所有行代码>语句,而max
没有。应该大大加快查询。If you're worried about performance, you can also do this with
MAX()
:ROW_NUMBER()
requires a sort of all the rows in yourSELECT
statement, whereasMAX
does not. Should drastically speed up your query.这是关于该主题的最容易找到的问题之一,因此我想对IT给出现代答案(既供我的参考和帮助他人)。通过使用
可以简要介绍上述查询:
first_value
和这应该在SQL Server 2012及以上。
first_value
可以将其视为完成在使用
子句时选择时选择顶部1
的方法。Over
允许在选择列表中进行分组,因此不要编写嵌套的子查询(就像许多现有答案一样),而是以更可读性的方式做到这一点。This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using
first_value
andover
you can make short work of the above query:This should work in SQL Server 2012 and up.
First_value
can be thought of as a way to accomplishSelect Top 1
when using anover
clause.Over
allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion.什么数据库服务器?该代码对所有代码都不适用。
关于您的问题的后半部分,对我来说似乎是合理的。您可以将
DocumentStatusLogs
作为日志,但仍将最新信息存储在主表中。顺便说一句,如果您已经在文档表中拥有
dateCreated
列,则可以使用该表(只要dateCreated
,代码> documentStatusLogs )。编辑:MSSQL不支持使用,因此将其更改为:
What database server? This code doesn't work on all of them.
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave
DocumentStatusLogs
as a log, but still store the latest info in the main table.BTW, if you already have the
DateCreated
column in the Documents table you can just joinDocumentStatusLogs
using that (as long asDateCreated
is unique inDocumentStatusLogs
).Edit: MsSQL does not support USING, so change it to:
这是三种单独解决问题的方法,以及每个查询的最佳索引选择(请自己尝试索引,查看逻辑上的阅读,经过的时间,执行计划。我从经验上提供了建议此类查询无执行此特定问题)。
方法1 :使用row_number()。如果Rowstore索引无法提高性能,则可以尝试使用非集合/聚类的列店索引,例如具有聚合和分组的查询以及所有时间都在不同列中订购的表,Columnstore Index通常是最佳选择。
方法2 :使用first_value。如果Rowstore索引无法提高性能,则可以尝试使用非集合/聚类的列店索引,例如具有聚合和分组的查询以及所有时间都在不同列中订购的表,Columnstore Index通常是最佳选择。
方法3 :使用交叉应用。在覆盖查询中使用的列的DocumentStatusLogs表上创建Rowstore索引应足以覆盖查询,而无需列仓库索引。
Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).
Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.
这是一个古老的线程,但是我想我要把我的两分钱扔给公认的答案对我来说并不特别好。我在一个大数据集中尝试了GBN的解决方案,发现它非常慢(在SQL Server 2012中的500万加唱片中45秒; 45秒)。从执行计划中看,很明显,问题是它需要进行分类操作,从而大大减慢了事情。
这是我从实体框架中提出的替代方法,它不需要排序操作,并且进行了非群集索引搜索。这将执行时间降低到<上述记录集2秒。
现在,我假设原始问题中未完全指定的内容,但是如果您的表设计使您的ID列是自动插入ID,并且dateCreated用每个插入设置为当前日期,则不用上面的查询运行,您实际上可以从上订购ID而不是在DateCreated ,因此可以从GBN的解决方案(大约是执行时间的一半)获得可观的性能提升,因为这将提供相同的排序订单和排序订单和这是更快的。
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
该解决方案可用于获取每个分区的最新行(在示例中,n是1中的n在where语句和分区中是doc_id):
This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
从上面验证克林特的出色和正确答案:
下面的两个查询之间的性能很有趣。 52%是最重要的。 48%是第二个。使用独特而不是顺序提高性能的4%。但是订购的优势是按多列排序。
选项1:
选项2:
在Microsoft SQL Server Management Studio中:突出显示并运行第一个块后,突出显示了选项1和选项2,右键单击 - &gt; [显示估计的执行计划]。然后运行整个内容以查看结果。
选项1结果:
选项2结果:
注意:
我还避免在Where或子句中的子征服中存在 /避免,因为我经历了这件事引起了一些可怕的执行计划。但是里程有所不同。查看执行计划和个人资料绩效在需要的地方和何处!
Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
Option 1:
Option 2:
In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
Option 2 Results:
Note:
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
我的代码从每个组中选择顶级1
My code to select top 1 from each group
交叉应用
是我用于解决方案的方法,因为它对我有效,也适合我的客户。从我阅读的内容来看,如果它们的数据库大幅增长,应该提供最佳的整体性能。CROSS APPLY
was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.如果您只想通过DateCreated返回最近的文档订单,它将仅返回DocumentID的前1个文档
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
在您要避免使用row_count()的方案中,您也可以使用左JOIN:
不在子查询”,通常将其编译为与左JOIN相同的输出:
对于示例架构,您也可以使用“ 如果表至少没有一个单列唯一键/约束/索引,在这种情况下,子查询模式将不起作用。
这两个查询往往比Row_count()查询(按查询分析仪衡量)更“昂贵”。但是,您可能会遇到更快返回结果或启用其他优化结果的方案。
In scenarios where you want to avoid using row_count(), you can also use a left join:
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
尝试以下操作:
Try this:
一些数据库引擎*开始支持
符合条件
子句,该子句允许过滤窗口函数的结果(所接受的答案使用)。因此,接受的答案可以变成
本文的深入说明:
您可以使用此工具来查看哪个数据库支持此子句: https://www.jooq.org/translate/
当目标方言不支持时,可以选择转换资格条款。
*Teradata,Bigquery,H2,雪花...
Some database engines* are starting to support the
QUALIFY
clause that allows to filter the result of window functions (which the accepted answer uses).So the accepted answer can become
See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify
You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.
*Teradata, BigQuery, H2, Snowflake...