如何有效地计算按另一列排序的一列的 MAX？

发布于 2024-09-07 09:42:54 字数 1394 浏览 9 评论 0原文

我有一个类似于以下内容的表模式（简化的）：

CREATE TABLE Transactions
(
    TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    CustomerID int NOT NULL,  -- Foreign key, not shown
    TransactionDate datetime NOT NULL,
    ...
)

CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)

为了在这里提供一些背景知识，这个事务表实际上整合了来自另一个供应商数据库的几种不同类型的事务（我们将其称为 ETL 过程），因此我对它们的插入顺序没有太多的控制。即使我这样做了，交易也可能会回溯，因此这里需要注意的重要一点是，任何给定客户的最大 TransactionID 不一定是最近的交易.

事实上，最近的交易是日期和 ID 的组合。日期不是唯一的 - 供应商经常截断一天中的时间 - 因此为了获取最新的交易，我必须首先找到最近的日期，然后找到该日期的最新 ID。

我知道我可以使用窗口查询（ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)）来完成此操作，但这需要完整的索引扫描和非常昂贵的排序，因此会失败就效率而言，惨不忍睹。一直写下去也很尴尬。

效率稍高的方法是使用两个 CTE 或嵌套子查询，一个用于查找每个 CustomerID 的 MAX(TransactionDate)，另一个用于查找 MAX(TransactionID)代码>.同样，它可以工作，但需要第二次聚合和连接，这比 ROW_NUMBER() 查询稍好，但在性能方面仍然相当痛苦。

我还考虑过使用 CLR 用户定义的聚合，并且在必要时会依靠它，但如果可能的话，我更愿意找到一个纯 SQL 解决方案来简化部署（在本例中的其他任何地方都不需要 SQL-CLR）项目）。

所以问题具体是：

是否可以编写一个查询，该查询将返回每个 CustomerID 的最新 TransactionID，定义为最近 TransactionDate 的最大 TransactionID，并实现与普通 MAX/GROUP BY 性能相当的计划查询？

（换句话说，计划中唯一重要的步骤应该是索引扫描和流聚合。多次扫描、排序、连接等可能会太慢。）

原文

I have a table schema similar to the following (simplified):

CREATE TABLE Transactions
(
    TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    CustomerID int NOT NULL,  -- Foreign key, not shown
    TransactionDate datetime NOT NULL,
    ...
)

CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)

To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID for any given customer is not necessarily the most recent transaction.

In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.

I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.

Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate) per CustomerID, and another to find the MAX(TransactionID). Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER() query but still rather painful performance-wise.

I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).

So the question, specifically is:

Is it possible to write a query that will return the newest TransactionID per CustomerID, defined as the maximum TransactionID for the most recent TransactionDate, and achieve a plan equivalent in performance to an ordinary MAX/GROUP BY query?

(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

〗斷ホ乔殘χμё〖 2024-09-14 09:42:54

最有用的索引可能是：

CustomerID, TransactionDate desc, TransactionId desc

然后你可以尝试这样的查询：

select  a.CustomerID
,       b.TransactionID
from    (
        select  distinct
                CustomerID
        from    YourTable
        ) a
cross apply   
        (
        select  top 1
                TransactionID
        from    YourTable
        where   CustomerID = a.CustomerID
        order by
                TransactionDate desc,
                TransactionId desc
        ) b

The most useful index might be:

CustomerID, TransactionDate desc, TransactionId desc

Then you could try a query like this:

select  a.CustomerID
,       b.TransactionID
from    (
        select  distinct
                CustomerID
        from    YourTable
        ) a
cross apply   
        (
        select  top 1
                TransactionID
        from    YourTable
        where   CustomerID = a.CustomerID
        order by
                TransactionDate desc,
                TransactionId desc
        ) b

回复收藏 0 原文

倒数 2024-09-14 09:42:54

像这样强制优化器首先计算派生表怎么样？在我的测试中，这比两个 Max 比较便宜。

Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
    Join    (
            Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
            From Transactions As T1
            Group By T1.CustomerId
            ) As Z
        On Z.CustomerId = T.CustomerId
            And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate

How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.

Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
    Join    (
            Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
            From Transactions As T1
            Group By T1.CustomerId
            ) As Z
        On Z.CustomerId = T.CustomerId
            And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate

回复收藏 0 原文

莳間冲淡了誓言ζ 2024-09-14 09:42:54

免责声明：大声思考 :)

您是否可以有一个索引计算列，将 TransactionDate 和 TransactionID 列组合成一种形式，这意味着查找最新交易只是查找该单个字段的 MAX 的情况？

回复收藏 0 原文

寄与心 2024-09-14 09:42:54

这似乎有很好的性能统计数据：

SELECT
    T1.customer_id,
    MAX(T1.transaction_id) AS transaction_id
FROM
    dbo.Transactions T1
INNER JOIN
(
    SELECT
        T2.customer_id,
        MAX(T2.transaction_date) AS max_dt
    FROM
        dbo.Transactions T2
    GROUP BY
        T2.customer_id
) SQ1 ON
    SQ1.customer_id = T1.customer_id AND
    T1.transaction_date = SQ1.max_dt
GROUP BY
    T1.customer_id

This one seemed to have good performance statistics:

SELECT
    T1.customer_id,
    MAX(T1.transaction_id) AS transaction_id
FROM
    dbo.Transactions T1
INNER JOIN
(
    SELECT
        T2.customer_id,
        MAX(T2.transaction_date) AS max_dt
    FROM
        dbo.Transactions T2
    GROUP BY
        T2.customer_id
) SQ1 ON
    SQ1.customer_id = T1.customer_id AND
    T1.transaction_date = SQ1.max_dt
GROUP BY
    T1.customer_id

回复收藏 0 原文

小姐丶请自重 2024-09-14 09:42:54

我想我实际上已经弄清楚了。 @Ada 有正确的想法，我自己也有同样的想法，但被困在如何形成单个复合 ID 并避免额外的连接上。

由于日期和（正）整数都是按字节排序的，因此它们不仅可以连接成 BLOB 进行聚合，而且还可以在聚合完成后将其分开。

这感觉有点不神圣，但它似乎起到了作用：

SELECT
    CustomerID,
    CAST(SUBSTRING(MAX(
        CAST(TransactionDate AS binary(8)) + 
        CAST(TransactionID AS binary(4))),
      9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID

这给了我一个单一的索引扫描和流聚合。也不需要任何额外的索引，它的执行与 MAX(TransactionID) 相同 - 显然，这是有道理的，因为所有串联都发生在聚合本身内部。

I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.

Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.

This feels a little unholy, but it seems to do the trick:

SELECT
    CustomerID,
    CAST(SUBSTRING(MAX(
        CAST(TransactionDate AS binary(8)) + 
        CAST(TransactionID AS binary(4))),
      9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID

That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID) - which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.

回复收藏 0 原文

~没有更多了~