将聚集物插入大桌子太慢

发布于 2025-01-26 17:07:24 字数 2078 浏览 4 评论 0原文

我需要作为Cronjob进行查询，并且执行太慢。它从视图表（data_time_series）总和一个值，每对aks 和ii，并将一行插入storecore_time_time_series 如果这些值尚未存在：以下是查询：

INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
   ( SELECT 
        now() AS time,
        SUM(value) AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM data_time_series
    GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );

查询的第一部分产生约200k行，并且很快：

SELECT 
        now() AS time,
        SUM(nullif(value, 'NaN'))  AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM public.data_time_series
    GROUP BY akd, ii

查询的第二部分检查数据是否存在于storege_time_time_series中。这部分太慢：

WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)

storege_time_series是一个巨大的表格，我需要保留数据，akd和ii ii 在storege> storege_time_series上索引。

这是表模式：

SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit

CREATE TABLE storage_time_series
(
    "time" timestamp with time zone,
    value double precision,
    akd integer,
    ii integer
)

CREATE  INDEX 
    ON storage_time_series USING btree
    (akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)

storege_time_series上的行数表：〜1.9亿 data_time_series 是一个视图，约有130万行。

有什么方法可以加快此过程？

编辑：我已经测试过创建 akd ， ii 和 value 将查询更改为Simple 插入和在冲突上无济于事，但似乎具有唯一的索引使插入过程比以前更慢。

原文

I need to run a query as cronjob and it is too slow to perform. It sums a value from a view table (data_time_series) for each pair of aks and ii and inserts a row in storage_time_series if those values are not present there yet:
Here is the query:

INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
   ( SELECT 
        now() AS time,
        SUM(value) AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM data_time_series
    GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );

The first part of the query produces around 200K rows and is fast:

SELECT 
        now() AS time,
        SUM(nullif(value, 'NaN'))  AS value,
        labels[2] AS akd,
        labels[4] AS ii
    FROM public.data_time_series
    GROUP BY akd, ii

The second part of the query checks if the data exists in storage_time_series. This part is too slow:

WHERE NOT EXISTS (
    SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)

storage_time_series is a huge table and I need to keep the data and akd and ii are indexed on storage_time_series.

Here is the tables schema:

SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit

CREATE TABLE storage_time_series
(
    "time" timestamp with time zone,
    value double precision,
    akd integer,
    ii integer
)

CREATE  INDEX 
    ON storage_time_series USING btree
    (akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)

Number of rows on storage_time_series table: ~190 million
data_time_series is a view and has about 1.3 million rows.

Is there any way to speed up this process?

EDITED: I've already tested creating unique index over akd, ii and value and changed query to simple INSERT and ON CONFLICT DO NOTHING, but seems having unique index make insert process very slower than before.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可是我不能没有你 2025-02-02 17:07:24

akd和ii在storege> storege_time_series
上索引

听起来不错。但是太模糊了，无法说更多。如果看来，（akd，ii，value）应该是唯一的，请相应地创建唯一的索引：

CREATE UNIQUE INDEX ON storage_time_series (akd, ii, value);

然后您的查询将燃烧至：

INSERT INTO storage_time_series
      (akd      , ii       , value     , time )
SELECT labels[2], labels[4], sum(value), now())
FROM   data_time_series
GROUP  BY 1, 2
ON     CONFLICT DO NOTHING;

哪个应该快速。很大程度上取决于未公开的细节。

akd and ii are indexed on storage_time_series

Sounds good. But too vague to say more. If, as it seems, (akd, ii, value) is supposed to be unique, create a UNIQUE INDEX accordingly:

CREATE UNIQUE INDEX ON storage_time_series (akd, ii, value);

Then your query burns down to:

INSERT INTO storage_time_series
      (akd      , ii       , value     , time )
SELECT labels[2], labels[4], sum(value), now())
FROM   data_time_series
GROUP  BY 1, 2
ON     CONFLICT DO NOTHING;

Which should be fast. Much depends undisclosed details.

回复收藏 0 原文

~没有更多了~