SQL 排名功能是否应被视为“谨慎使用”?
这个问题源于具体情况。
任何常见的 RDBMS 都包含一些排名功能,即其查询语言具有诸如 TOP n ... ORDER BY key
、ROW_NUMBER() OVER (ORDER BY key)
或 < code>ORDER BY key LIMIT n (概述 )。
如果您只想显示大量记录中的一小部分,那么它们在提高性能方面做得很好。但它们也引入了一个主要陷阱:如果key
不是唯一的,结果是不确定的。考虑以下示例:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
查询应该返回最后登录的人:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
正如预期的那样,返回了 George
并且一切看起来都很好。但随后一条新记录被插入到 logins
表中:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
上面的查询现在返回什么?林戈?乔治?你说不出来。据我记得,例如MySQL 4.1 返回物理创建的第一条符合条件的记录,即结果将是George
。但这可能因版本和 DBMS 的不同而有所不同。应该归还什么?人们可能会说“Ringo”,因为他显然是最后登录的,但这只是纯粹的解释。在我看来,两者都应该被返回,因为你无法根据可用的数据明确地做出决定。
因此,这个查询符合要求:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
作为替代方案,一些 DBMS 提供特殊功能(例如 Microsoft SQL Server 2005 引入了 TOP n WITH TIES ... ORDER BY key
(由 gbn)、RANK
和 DENSE_RANK
为此目的)。
如果您搜索例如ROW_NUMBER
,您会发现许多建议使用排名功能的解决方案,但没有指出可能存在的问题。
问题:如果提出包含排名功能的解决方案,应给出什么建议?
This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.
Any common RDBMS includes some ranking functionality, i.e. its query language has elements like TOP n ... ORDER BY key
, ROW_NUMBER() OVER (ORDER BY key)
, or ORDER BY key LIMIT n
(overview).
They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key
is not unique results are non-deterministic. Consider the following example:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
A query is supposed to return the person who logged in last:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Just as expected George
is returned and everything looks fine. But then a new record is inserted into logins
table:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
What does the query above return now? Ringo
? George
? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George
. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo
since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.
So this query matches the requirements:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key
(suggested by gbn), RANK
, and DENSE_RANK
for this very purpose).
If you search SO for e.g. ROW_NUMBER
you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.
Question: What advice should be given if a solution that includes ranking functionality is proposed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
rank
和row_number
是很棒的函数,应该更自由地使用,IMO。人们只是不了解他们。话虽这么说,您需要确保您的排名依据是唯一的。制定重复项(尤其是日期)的备份计划。您返回的数据与您输入的数据一样好。
我认为这里的陷阱与查询中的完全相同:
您需要了解您订购的内容并确保有某种方法可以始终有一个赢家。如果没有,您将得到(可能)随机的两行最大日期。
另外,郑重声明,SQL Server 并不按照行插入的物理顺序存储行。它在 8k 页上存储记录,并根据表上的聚集索引以最有效的方式对这些页进行排序。因此,SQL Server 中绝对不能保证顺序。
rank
androw_number
are fantastic functions that should be used more liberally, IMO. Folks just don't know about them.That being said, you need to make sure what you're ranking by is unique. Have a backup plan for duplicates (esp. dates). The data you get back is only as good as the data you put in.
I think the pitfalls here are the exact same in the query:
You need to be aware of what you're ordering on and ensure that there is some way to always have a winner. If not, you get a (potentially) random two rows with the max date.
Also, for the record, SQL Server does not store rows in the physical order that they are inserted. It stores records on 8k pages and orders those pages in the most efficient way it can according to the clustered index on the table. Therefore, there is absolutely no guarantee of order in SQL Server.
在上面的示例中使用WITH TIES子句
按照您提到的方式使用DENSE_RANK
不把自己置于这个位置
示例:也存储时间(日期时间)并接受在同一 3.33 毫秒瞬间出现非常罕见的重复的风险非常低(SQL 2008 不同)
Use the WITH TIES clause in your example above
Use DENSE_RANK as you mentioned
Not put myself in this position
Example: Store time too (datetime) and accept the very low risk of a very rare duplicate in the same 3.33 millisecond instant (SQL 2008 is different)
每个数据库引擎都使用某种类型的行标识符,以便它可以区分两行。
这些标识符是:
MyISAM
中的行指针InnoDB
表中的主键,并中定义了
Uniquifier
SQL Server
堆表中没有定义RID
的 >InnoDBSQL Server< 中的主键/code> 的表聚集在
PRIMARY/UNIQUE KEY
SQL Server
的表中的uniquifier
聚集在非唯一键上Oracle
中的ROWID
/UROWID
PostgreSQL
中的CTID
。您无法立即访问以下内容:
MyISAM
中的行指针InnoDB
表中的Uniquifier
没有PRIMARY KEY< /code> 在
SQL Server
的堆表中定义RID
在SQL Server
的表中索引键 +uniquifier
聚集在非唯一键上此外,您无法控制以下键:
Oracle
中的ROWID
/UROWID
CTID
在PostgreSQL
中。(它们可以在更新或从备份恢复时更改)
如果这些表中的两行相同,则意味着从应用程序的角度来看它们应该相同。
它们返回完全相同的结果,并且可以被视为最终的唯一标识符。
这只是意味着您应该始终在排序子句中包含某种您可以完全控制的唯一标识符,以保持排序一致。
如果您的表有主键或唯一键(甚至是复合键),请将其包含到排序条件中:
否则,将所有列包含到排序条件中:
后面的条件将始终返回任何无法区分的行,但由于它们无论如何都无法区分,因此从应用程序的角度来看它们看起来是一致的。
顺便说一句,这是表中始终有一个
主键
的另一个很好的理由。但不要依赖
ROWID
/CTID
对行进行排序。它很容易在
UPDATE
上更改,因此您的结果顺序将不再稳定。Every database engine uses some kind of a row identifier so that it can distinguish between two rows.
These identifiers are:
MyISAM
InnoDB
table with aPRIMARY KEY
definedUniquifier
inInnoDB
table without aPRIMARY KEY
definedRID
inSQL Server
's heap tableSQL Server
's table clustered onPRIMARY/UNIQUE KEY
uniquifier
inSQL Server
's table clustered on a non-unique keyROWID
/UROWID
inOracle
CTID
inPostgreSQL
.You don't have an immediate access to the following ones:
MyISAM
Uniquifier
inInnoDB
table without aPRIMARY KEY
definedRID
inSQL Server
's heap tableuniquifier
inSQL Server
's table clustered on a non-unique keyBesides, you don't have control over the following ones:
ROWID
/UROWID
inOracle
CTID
inPostgreSQL
.(they can change on updates or restoring from backups)
If two rows are identical in these tables, that means they should be identical from the application's point of view.
They return exactly same results and can be treated as an ultimate uniquifier.
This just means you should always include some kind of a uniquifier you have full control over to the ordering clause to keep your ordering consistent.
If your table has a primary or unique key (even composite), include it into the ordering condition:
Otherwise, include all columns into the ordering condition:
The later condition will always return any of the otherwise indistinguishable rows, but since they're indistinguishable anyway, it will look consistent from your applications's point of view.
That, by the way, is another good reason for always having a
PRIMARY KEY
in your tables.But do not rely on
ROWID
/CTID
to order rows.It can easily change on
UPDATE
so your result order will not be stable anymore.ROW_NUMBER 确实是一个很棒的工具。如果误用,它可能会提供不确定的结果,但其他 SQL 函数也会如此。您也可以让 ORDER BY 返回不确定的结果。
只要知道你在做什么就可以了。
ROW_NUMBER is a fantastic tool indeed. If misused it can provide non-deterministic results, but so will the other SQL functions. You can have ORDER BY return non-deterministic results as well.
Just know what you are doing.
总结如下:
n
行,或者是否期望满足约束的可能不同数量的行?重新考虑你的设计。如果您准确地期望n
行,并且无法明确识别行,那么您的模型可能设计得很糟糕。如果您预计行数可能会有所不同,则可能需要调整 UI 才能显示查询结果。key
,使其具有唯一性(例如PK)。您至少重新获得了对返回结果的控制权。几乎总是有一种方法可以做到这一点,如RANK
、DENSE_RANK
和TOP n WITH TIES
。它们在 Microsoft SQL Server 2005 版本和 PosgreSQL 8.4 版本中可用。如果这些函数不可用,请考虑使用带有聚合的嵌套查询而不是排名函数。This is the summary:
n
rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you're expectingn
rows exactly, your model might be designed poorly if it's impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results.key
that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out.RANK
,DENSE_RANK
andTOP n WITH TIES
. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.