MySQL递归CTE查询的性能缓慢

发布于 2025-02-06 19:47:57 字数 3969 浏览 2 评论 0原文

我想计算一组超过30,000,000行的大量分钟数量的9分钟指数移动平均线（EMA），大约有4,500个不同的股票。由于EMA的性质，我使用递归CTE的查询始终基于上一行（分钟）。

该代码有效，但这是问题所在。为了计算仅两张股票的EMA，总计构成了14,000行的分钟桌，查询占40分钟的时间。假设整个数据集的每行也一样快，则将MySQL-Server在60h和70h之间进行执行。

minute_data表的构建如下：

create table min_data
(
    t            datetime       not null,
    ticker       varchar(10)    not null,
    o            decimal(10, 4) not null,
    h            decimal(10, 4) not null,
    l            decimal(10, 4) not null,
    c            decimal(10, 4) not null,
    primary key (t, ticker)
);

我只使用以下列：

t = date＆amp;每次交易分钟
tricker的时间=股票符号（例如tesla - ＆gt; tsla）
c =每个交易的关闭价格

EMA计算：
ema =收盘价（当前分钟） * alpha + ema（前分钟） *（1-alpha）

SET GLOBAL cte_max_recursion_depth=1000000;

SET @alpha = 2 / (1 + 9);

CREATE TABLE min_data_EMA9 AS
WITH RECURSIVE t AS (
    SELECT t, ticker,
           row_number() over (partition by ticker order by t) as QuoteId,
           c
    FROM min_data
),

ema (t, ticker, QuoteId, c, EMA9) AS (
    SELECT *, avg(c) as EMA9
    FROM t
    WHERE QuoteId between 1 and 8
    GROUP BY ticker

    UNION ALL

    SELECT t2.t,
           t2.ticker,
           t2.QuoteId,
           t2.c,
           @alpha * t2.c + (1 - @alpha) * EMA9 as EMA9
    FROM ema
    JOIN t t2
        ON ema.QuoteId = t2.QuoteId - 1
        AND  ema.ticker = t2.ticker
)

SELECT t, ticker, QuoteId, EMA9
FROM ema;

在以以下方式限制第一个选择语句时 其中ticker ='tops' to一个单股用语句返回以下（executionTime = 18min 39s）

-> Table scan on ema  (cost=0.01..37822.72 rows=3025619) (actual time=0.002..4.562 rows=68471 loops=1)
    -> Materialize recursive CTE ema  (cost=1395569.12..1433391.84 rows=3025619) (actual time=1097987.024..1097996.206 rows=68471 loops=1)
        -> Table scan on <temporary>  (actual time=0.001..0.001 rows=1 loops=1)
            -> Aggregate using temporary table  (actual time=30081.632..30081.633 rows=1 loops=1)
                -> Filter: (t.QuoteId between 1 and 8)  (cost=1.01..306376.90 rows=302562) (actual time=30069.843..30081.576 rows=8 loops=1)
                    -> Table scan on t  (cost=2.50..2.50 rows=0) (actual time=0.001..3.723 rows=68471 loops=1)
                        -> Materialize CTE t if needed  (cost=2.50..2.50 rows=0) (actual time=30069.836..30078.154 rows=68471 loops=1)
                            -> Window aggregate: row_number() OVER (PARTITION BY min_data.ticker ORDER BY min_data.t )   (actual time=29997.654..30020.406 rows=68471 loops=1)
                                -> Sort: min_data.ticker, min_data.t  (cost=2861564.76 rows=27233289) (actual time=29997.639..30002.597 rows=68471 loops=1)
                                    -> Filter: (min_data.ticker = 'TOPS')  (cost=2861564.76 rows=27233289) (actual time=0.512..29953.891 rows=68471 loops=1)
                                        -> Table scan on min_data  (cost=2861564.76 rows=27233289) (actual time=0.510..27585.660 rows=30323912 loops=1)
        -> Repeat until convergence
            -> Nested loop inner join  (cost=1093007.22 rows=3025619) (actual time=0.010..533731.050 rows=34235 loops=2)
                -> Filter: (ema.ticker is not null)  (cost=34040.61 rows=302561) (actual time=0.004..50.239 rows=34236 loops=2)
                    -> Scan new records on ema  (cost=34040.61 rows=302561) (actual time=0.003..26.824 rows=34236 loops=2)
                -> Filter: (ema.QuoteId = (t2.QuoteId - 1))  (cost=0.25..2.50 rows=10) (actual time=7.784..15.587 rows=1 loops=68471)
                    -> Index lookup on t2 using <auto_key0> (ticker=ema.ticker)  (actual time=0.004..7.431 rows=68471 loops=68471)
                        -> Materialize CTE t if needed (query plan printed elsewhere)  (cost=0.00..0.00 rows=0) (never executed)

：新的递归CTE和一定的扩展也可以查询优化。因此，我将为您提供有关如何使此查询更快的建议！

原文

I want to calculate the 9 min Exponential Moving Average (EMA) for a large set of minute stock-data exceeding 30,000,000 rows with roughly 4,500 different Tickers.
The query I have makes use of recursive cte due to the nature of the EMA, which is always based on the previous row (minute).

The code works but here is the problem. In order to calculate the EMA for only two stocks which in sum make up 14,000 rows of the minute_data table the query took 19min 40s. Assuming it is just as fast per row for the whole data-set it would take the MYSQL-server between 60h and 70h to execute.

The minute_data table is built as follows:

create table min_data
(
    t            datetime       not null,
    ticker       varchar(10)    not null,
    o            decimal(10, 4) not null,
    h            decimal(10, 4) not null,
    l            decimal(10, 4) not null,
    c            decimal(10, 4) not null,
    primary key (t, ticker)
);

I will only use the following columns:

t = date & time of each trading minute
ticker = Stock Symbol (e.g. Tesla -> TSLA)
c = close price of each trading minute

EMA calculation:
EMA = Closing price (current minute) * alpha + EMA (previous minute) * (1-alpha)

SET GLOBAL cte_max_recursion_depth=1000000;

SET @alpha = 2 / (1 + 9);

CREATE TABLE min_data_EMA9 AS
WITH RECURSIVE t AS (
    SELECT t, ticker,
           row_number() over (partition by ticker order by t) as QuoteId,
           c
    FROM min_data
),

ema (t, ticker, QuoteId, c, EMA9) AS (
    SELECT *, avg(c) as EMA9
    FROM t
    WHERE QuoteId between 1 and 8
    GROUP BY ticker

    UNION ALL

    SELECT t2.t,
           t2.ticker,
           t2.QuoteId,
           t2.c,
           @alpha * t2.c + (1 - @alpha) * EMA9 as EMA9
    FROM ema
    JOIN t t2
        ON ema.QuoteId = t2.QuoteId - 1
        AND  ema.ticker = t2.ticker
)

SELECT t, ticker, QuoteId, EMA9
FROM ema;

When limiting the first select statement with:
WHERE ticker = 'TOPS' to one single Stock the EXPLAIN ANALYZE function for the WITHstatement returns the following (Executiontime = 18min 39s):

-> Table scan on ema  (cost=0.01..37822.72 rows=3025619) (actual time=0.002..4.562 rows=68471 loops=1)
    -> Materialize recursive CTE ema  (cost=1395569.12..1433391.84 rows=3025619) (actual time=1097987.024..1097996.206 rows=68471 loops=1)
        -> Table scan on <temporary>  (actual time=0.001..0.001 rows=1 loops=1)
            -> Aggregate using temporary table  (actual time=30081.632..30081.633 rows=1 loops=1)
                -> Filter: (t.QuoteId between 1 and 8)  (cost=1.01..306376.90 rows=302562) (actual time=30069.843..30081.576 rows=8 loops=1)
                    -> Table scan on t  (cost=2.50..2.50 rows=0) (actual time=0.001..3.723 rows=68471 loops=1)
                        -> Materialize CTE t if needed  (cost=2.50..2.50 rows=0) (actual time=30069.836..30078.154 rows=68471 loops=1)
                            -> Window aggregate: row_number() OVER (PARTITION BY min_data.ticker ORDER BY min_data.t )   (actual time=29997.654..30020.406 rows=68471 loops=1)
                                -> Sort: min_data.ticker, min_data.t  (cost=2861564.76 rows=27233289) (actual time=29997.639..30002.597 rows=68471 loops=1)
                                    -> Filter: (min_data.ticker = 'TOPS')  (cost=2861564.76 rows=27233289) (actual time=0.512..29953.891 rows=68471 loops=1)
                                        -> Table scan on min_data  (cost=2861564.76 rows=27233289) (actual time=0.510..27585.660 rows=30323912 loops=1)
        -> Repeat until convergence
            -> Nested loop inner join  (cost=1093007.22 rows=3025619) (actual time=0.010..533731.050 rows=34235 loops=2)
                -> Filter: (ema.ticker is not null)  (cost=34040.61 rows=302561) (actual time=0.004..50.239 rows=34236 loops=2)
                    -> Scan new records on ema  (cost=34040.61 rows=302561) (actual time=0.003..26.824 rows=34236 loops=2)
                -> Filter: (ema.QuoteId = (t2.QuoteId - 1))  (cost=0.25..2.50 rows=10) (actual time=7.784..15.587 rows=1 loops=68471)
                    -> Index lookup on t2 using <auto_key0> (ticker=ema.ticker)  (actual time=0.004..7.431 rows=68471 loops=68471)
                        -> Materialize CTE t if needed (query plan printed elsewhere)  (cost=0.00..0.00 rows=0) (never executed)

I'm new to recursive cte and to a certain extend also to query optimization. Therefore I will be greatfull for your suggestions on how to make this query a lot faster!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花桑 2025-02-13 19:47:57

由于您一次专注于一个股票，所以这可能会好多了：

PRIMARY KEY(ticker, t)

您是在第一个读取EMA并阅读到最后一次吗？并且不保存任何中间结果吗？您使用的因素是什么？根据因素的不同，您实际上不需要以超过最后一百个左右的值进行计算。

即使没有任何变化，您是否也存储每分钟的收盘价？

假设市场永远不会在一个漫长的周末（即，没有9/11打ic）上关闭，这应该足够（并且过度杀伤）：

WHERE ticker = ?
  AND t > NOW() - INTERVAL 4 DAYS
ORDER BY t

我强烈建议您选择 参赛作品并进行EMA在您的申请中。它的速度比< / code> / cte / cte /存储的函数 /等要快得多。

请注意，我更改主键将使和订购订购的行将是连续订购的。而且，与的不同，只需要进行一次通行证。在您的应用程序中，您只需从选择中收到的数组进行一次通过即可。

今天，您只有2个股票；明天您将有3个。为每个股票做一个选择并计算其EMA。也就是说，循环通过应用程序内的2个以上的股票。

另一个想法...在每个小时结束时（包括一天的结束）存储EMA。然后，当您想重新计算它时，请从那里捡起；不会丢失精度，但要做的计算要少得多。

将每小时的EMA存放在单独的桌子中。 EMA算法的价值非常简单。现在，您正在查看类似

WHERE ticker = ?
  AND t > CONCAT(LEFT(NOW(), 13), ":00:00")
ORDER BY t

concat的内容计算当前小时的开始（作为字符串），然后使用该contecat以与t（a dateTime）进行比较。同样，我的PK和“一次做一个股票”建议适用。

Since you are focused on one ticker at a time, this would probably be much better:

PRIMARY KEY(ticker, t)

Are you calculating the EMA from the first and reading to the last? And not saving any intermediate results? What is the factor you are using? Depending on the factor, you really don't need to compute with more than the last hundred or so values.

Do you store the closing price for every minute, even if nothing changes?

Assuming the market never closes for more than a long weekend (that is, no 9/11 hiccups), this should be sufficient (and overkill):

WHERE ticker = ?
  AND t > NOW() - INTERVAL 4 DAYS
ORDER BY t

I strongly suggest that you SELECT the entries and do the EMA in your application. It will be a a lot faster than OVER / CTE / stored function / etc. inside MySQL.

Note that my change to the PRIMARY KEY will make the rows fetched by the WHERE and ORDER BY will be consecutive and already ordered. And, unlike OVER, only one pass will need to be made. In your app, you can simply make one pass over the array received from the SELECT.

Today you have "only 2 stocks"; tomorrow you will have 3. Do one Select for each stock and compute its EMA. That is, loop through the 2+ stocks inside your app.

Another thought... Store the EMA at the end of each hour (including the end of the day). Then, when you want to recompute it, pick up from there; there will be no precision lost, but a lot less computation to do.

Store those hourly EMAs in a separate table. The value primes the EMA algorithm very simply. Now you are looking at something like

WHERE ticker = ?
  AND t > CONCAT(LEFT(NOW(), 13), ":00:00")
ORDER BY t

That CONCAT computes the start of the current hour (as a string) and then uses that to compare with t (a DATETIME). Again, my PK and the "do one ticker at a time" suggestions apply.

回复收藏 0 原文

~没有更多了~