SQL 排名功能是否应被视为“谨慎使用”？

发布于 2024-08-02 13:57:34 字数 1997 浏览 16 评论 0原文

这个问题源于具体情况。

任何常见的 RDBMS 都包含一些排名功能，即其查询语言具有诸如 TOP n ... ORDER BY key、ROW_NUMBER() OVER (ORDER BY key) 或 < code>ORDER BY key LIMIT n （概述）。

如果您只想显示大量记录中的一小部分，那么它们在提高性能方面做得很好。但它们也引入了一个主要陷阱：如果key不是唯一的，结果是不确定的。考虑以下示例：

users

user_id name
1       John
2       Paul
3       George
4       Ringo

logins

login_id user_id login_date
1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20

查询应该返回最后登录的人：

SELECT TOP 1 users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

正如预期的那样，返回了 George 并且一切看起来都很好。但随后一条新记录被插入到 logins 表中：

1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20
5        4       2009-08-20

上面的查询现在返回什么？林戈？乔治？你说不出来。据我记得，例如MySQL 4.1 返回物理创建的第一条符合条件的记录，即结果将是George。但这可能因版本和 DBMS 的不同而有所不同。应该归还什么？人们可能会说“Ringo”，因为他显然是最后登录的，但这只是纯粹的解释。在我看来，两者都应该被返回，因为你无法根据可用的数据明确地做出决定。

因此，这个查询符合要求：

SELECT users.*
FROM
  logins JOIN
  users ON
    logins.user_id = users.user_id AND
    logins.login_date = (
      SELECT max(logins.login_date)
      FROM
        logins JOIN
        users ON logins.user_id = users.user_id)

作为替代方案，一些 DBMS 提供特殊功能（例如 Microsoft SQL Server 2005 引入了 TOP n WITH TIES ... ORDER BY key （由 gbn）、RANK 和 DENSE_RANK 为此目的）。

如果您搜索例如ROW_NUMBER，您会发现许多建议使用排名功能的解决方案，但没有指出可能存在的问题。

问题：如果提出包含排名功能的解决方案，应给出什么建议？

原文

This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.

Any common RDBMS includes some ranking functionality, i.e. its query language has elements like TOP n ... ORDER BY key, ROW_NUMBER() OVER (ORDER BY key), or ORDER BY key LIMIT n (overview).

They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key is not unique results are non-deterministic. Consider the following example:

users

user_id name
1       John
2       Paul
3       George
4       Ringo

logins

login_id user_id login_date
1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20

A query is supposed to return the person who logged in last:

SELECT TOP 1 users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

Just as expected George is returned and everything looks fine. But then a new record is inserted into logins table:

1        4       2009-08-17
2        1       2009-08-18
3        2       2009-08-19
4        3       2009-08-20
5        4       2009-08-20

What does the query above return now? Ringo? George? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.

So this query matches the requirements:

SELECT users.*
FROM
  logins JOIN
  users ON
    logins.user_id = users.user_id AND
    logins.login_date = (
      SELECT max(logins.login_date)
      FROM
        logins JOIN
        users ON logins.user_id = users.user_id)

As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key (suggested by gbn), RANK, and DENSE_RANK for this very purpose).

If you search SO for e.g. ROW_NUMBER you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.

Question: What advice should be given if a solution that includes ranking functionality is proposed?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

郁金香雨 2024-08-09 13:57:34

rank 和 row_number 是很棒的函数，应该更自由地使用，IMO。人们只是不了解他们。

话虽这么说，您需要确保您的排名依据是唯一的。制定重复项（尤其是日期）的备份计划。您返回的数据与您输入的数据一样好。

我认为这里的陷阱与查询中的完全相同：

select top 2 * from tblA order by date desc

您需要了解您订购的内容并确保有某种方法可以始终有一个赢家。如果没有，您将得到（可能）随机的两行最大日期。

另外，郑重声明，SQL Server 并不按照行插入的物理顺序存储行。它在 8k 页上存储记录，并根据表上的聚集索引以最有效的方式对这些页进行排序。因此，SQL Server 中绝对不能保证顺序。

rank and row_number are fantastic functions that should be used more liberally, IMO. Folks just don't know about them.

That being said, you need to make sure what you're ranking by is unique. Have a backup plan for duplicates (esp. dates). The data you get back is only as good as the data you put in.

I think the pitfalls here are the exact same in the query:

select top 2 * from tblA order by date desc

You need to be aware of what you're ordering on and ensure that there is some way to always have a winner. If not, you get a (potentially) random two rows with the max date.

Also, for the record, SQL Server does not store rows in the physical order that they are inserted. It stores records on 8k pages and orders those pages in the most efficient way it can according to the clustered index on the table. Therefore, there is absolutely no guarantee of order in SQL Server.

回复收藏 0 原文

萌吟 2024-08-09 13:57:34

在上面的示例中使用WITH TIES子句

SELECT TOP 1 WITH TIES users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

按照您提到的方式使用DENSE_RANK

不把自己置于这个位置
示例：也存储时间（日期时间）并接受在同一 3.33 毫秒瞬间出现非常罕见的重复的风险非常低（SQL 2008 不同）

Use the WITH TIES clause in your example above

SELECT TOP 1 WITH TIES users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

Use DENSE_RANK as you mentioned

Not put myself in this position
Example: Store time too (datetime) and accept the very low risk of a very rare duplicate in the same 3.33 millisecond instant (SQL 2008 is different)

回复收藏 0 原文

嘿哥们儿 2024-08-09 13:57:34

每个数据库引擎都使用某种类型的行标识符，以便它可以区分两行。

这些标识符是：

MyISAM 中的行指针
InnoDB 表中的主键，并
在 中定义了 Uniquifier SQL Server 堆表中没有定义 RID 的 >InnoDB
表
SQL Server< 中的主键/code> 的表聚集在 PRIMARY/UNIQUE KEY
索引键 + SQL Server 的表中的 uniquifier 聚集在非唯一键上
Oracle 中的 ROWID / UROWID
PostgreSQL 中的 CTID。

您无法立即访问以下内容：

MyISAM 中的行指针
InnoDB 表中的 Uniquifier 没有 PRIMARY KEY< /code> 在 SQL Server 的堆表中定义
RID 在
SQL Server 的表中索引键 + uniquifier聚集在非唯一键上

此外，您无法控制以下键：

Oracle 中的 ROWID / UROWID
CTID 在 PostgreSQL 中。

（它们可以在更新或从备份恢复时更改）

如果这些表中的两行相同，则意味着从应用程序的角度来看它们应该相同。

它们返回完全相同的结果，并且可以被视为最终的唯一标识符。

这只是意味着您应该始终在排序子句中包含某种您可以完全控制的唯一标识符，以保持排序一致。

如果您的表有主键或唯一键（甚至是复合键），请将其包含到排序条件中：

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, pk

否则，将所有列包含到排序条件中：

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, column1, ..., columnN

后面的条件将始终返回任何无法区分的行，但由于它们无论如何都无法区分，因此从应用程序的角度来看它们看起来是一致的。

顺便说一句，这是表中始终有一个主键的另一个很好的理由。

但不要依赖 ROWID / CTID 对行进行排序。

它很容易在 UPDATE 上更改，因此您的结果顺序将不再稳定。

Every database engine uses some kind of a row identifier so that it can distinguish between two rows.

These identifiers are:

Row pointer in MyISAM
Primary key in InnoDB table with a PRIMARY KEY defined
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Primary key in SQL Server's table clustered on PRIMARY/UNIQUE KEY
Index key + uniquifier in SQL Server's table clustered on a non-unique key
ROWID / UROWID in Oracle
CTID in PostgreSQL.

You don't have an immediate access to the following ones:

Row pointer in MyISAM
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Index key + uniquifier in SQL Server's table clustered on a non-unique key

Besides, you don't have control over the following ones:

ROWID / UROWID in Oracle
CTID in PostgreSQL.

(they can change on updates or restoring from backups)

If two rows are identical in these tables, that means they should be identical from the application's point of view.

They return exactly same results and can be treated as an ultimate uniquifier.

This just means you should always include some kind of a uniquifier you have full control over to the ordering clause to keep your ordering consistent.

If your table has a primary or unique key (even composite), include it into the ordering condition:

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, pk

Otherwise, include all columns into the ordering condition:

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, column1, ..., columnN

The later condition will always return any of the otherwise indistinguishable rows, but since they're indistinguishable anyway, it will look consistent from your applications's point of view.

That, by the way, is another good reason for always having a PRIMARY KEY in your tables.

But do not rely on ROWID / CTID to order rows.

It can easily change on UPDATE so your result order will not be stable anymore.

回复收藏 0 原文

浅语花开 2024-08-09 13:57:34

ROW_NUMBER 确实是一个很棒的工具。如果误用，它可能会提供不确定的结果，但其他 SQL 函数也会如此。您也可以让 ORDER BY 返回不确定的结果。

只要知道你在做什么就可以了。

回复收藏 0 原文

固执像三岁 2024-08-09 13:57:34

总结如下：

首先用你的头脑。应该是显而易见的，但这始终是一个很好的起点。您是否期望恰好有 n 行，或者是否期望满足约束的可能不同数量的行？重新考虑你的设计。如果您准确地期望 n 行，并且无法明确识别行，那么您的模型可能设计得很糟糕。如果您预计行数可能会有所不同，则可能需要调整 UI 才能显示查询结果。
将列添加到key，使其具有唯一性（例如PK）。您至少重新获得了对返回结果的控制权。几乎总是有一种方法可以做到这一点，如
考虑使用可能更合适的函数，例如 RANK、DENSE_RANK 和 TOP n WITH TIES。它们在 Microsoft SQL Server 2005 版本和 PosgreSQL 8.4 版本中可用。如果这些函数不可用，请考虑使用带有聚合的嵌套查询而不是排名函数。