如何有效地计算按另一列排序的一列的 MAX?
我有一个类似于以下内容的表模式(简化的):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
为了在这里提供一些背景知识,这个事务表实际上整合了来自另一个供应商数据库的几种不同类型的事务(我们将其称为 ETL 过程),因此我对它们的插入顺序没有太多的控制。即使我这样做了,交易也可能会回溯,因此这里需要注意的重要一点是,任何给定客户的最大 TransactionID
不一定是最近的交易.
事实上,最近的交易是日期和 ID 的组合。日期不是唯一的 - 供应商经常截断一天中的时间 - 因此为了获取最新的交易,我必须首先找到最近的日期,然后找到该日期的最新 ID。
我知道我可以使用窗口查询(ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC))来完成此操作,但这需要完整的索引扫描和非常昂贵的排序,因此会失败就效率而言,惨不忍睹。一直写下去也很尴尬。
效率稍高的方法是使用两个 CTE 或嵌套子查询,一个用于查找每个 CustomerID
的 MAX(TransactionDate)
,另一个用于查找 MAX(TransactionID)
代码>.同样,它可以工作,但需要第二次聚合和连接,这比 ROW_NUMBER() 查询稍好,但在性能方面仍然相当痛苦。
我还考虑过使用 CLR 用户定义的聚合,并且在必要时会依靠它,但如果可能的话,我更愿意找到一个纯 SQL 解决方案来简化部署(在本例中的其他任何地方都不需要 SQL-CLR)项目)。
所以问题具体是:
是否可以编写一个查询,该查询将返回每个 CustomerID
的最新 TransactionID
,定义为最近 TransactionDate
的最大 TransactionID
,并实现与普通 MAX
/GROUP BY
性能相当的计划查询?
(换句话说,计划中唯一重要的步骤应该是索引扫描和流聚合。多次扫描、排序、连接等可能会太慢。)
I have a table schema similar to the following (simplified):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID
for any given customer
is not necessarily the most recent transaction.
In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.
I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)
), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.
Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate)
per CustomerID
, and another to find the MAX(TransactionID)
. Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER()
query but still rather painful performance-wise.
I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).
So the question, specifically is:
Is it possible to write a query that will return the newest TransactionID
per CustomerID
, defined as the maximum TransactionID
for the most recent TransactionDate
, and achieve a plan equivalent in performance to an ordinary MAX
/GROUP BY
query?
(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
最有用的索引可能是:
然后你可以尝试这样的查询:
The most useful index might be:
Then you could try a query like this:
像这样强制优化器首先计算派生表怎么样?在我的测试中,这比两个 Max 比较便宜。
How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.
免责声明:大声思考 :)
您是否可以有一个索引计算列,将 TransactionDate 和 TransactionID 列组合成一种形式,这意味着查找最新交易只是查找该单个字段的 MAX 的情况?
Disclaimer: Thinking out loud :)
Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?
这似乎有很好的性能统计数据:
This one seemed to have good performance statistics:
我想我实际上已经弄清楚了。 @Ada 有正确的想法,我自己也有同样的想法,但被困在如何形成单个复合 ID 并避免额外的连接上。
由于日期和(正)整数都是按字节排序的,因此它们不仅可以连接成 BLOB 进行聚合,而且还可以在聚合完成后将其分开。
这感觉有点不神圣,但它似乎起到了作用:
这给了我一个单一的索引扫描和流聚合。也不需要任何额外的索引,它的执行与 MAX(TransactionID) 相同 - 显然,这是有道理的,因为所有串联都发生在聚合本身内部。
I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.
Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.
This feels a little unholy, but it seems to do the trick:
That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing
MAX(TransactionID)
- which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.