将聚集物插入大桌子太慢
我需要作为Cronjob进行查询,并且执行太慢。它从视图表(data_time_series
)总和一个值,每对aks
和ii
,并将一行插入storecore_time_time_series 如果这些值尚未存在: 以下是查询:
INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
( SELECT
now() AS time,
SUM(value) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM data_time_series
GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );
查询的第一部分产生约200k行,并且很快:
SELECT
now() AS time,
SUM(nullif(value, 'NaN')) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM public.data_time_series
GROUP BY akd, ii
查询的第二部分检查数据是否存在于storege_time_time_series
中。这部分太慢:
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)
storege_time_series
是一个巨大的表格,我需要保留数据,akd
和ii
ii 在storege> storege_time_series上索引
。
这是表模式:
SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
CREATE TABLE storage_time_series
(
"time" timestamp with time zone,
value double precision,
akd integer,
ii integer
)
CREATE INDEX
ON storage_time_series USING btree
(akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)
storege_time_series上的行数表:〜1.9亿 data_time_series 是一个视图,约有130万行。
有什么方法可以加快此过程?
编辑:我已经测试过创建 akd , ii 和 value 将查询更改为Simple 插入和在冲突上无济于事,但似乎具有
唯一的索引
使插入过程比以前更慢。
I need to run a query as cronjob and it is too slow to perform. It sums a value from a view table (data_time_series
) for each pair of aks
and ii
and inserts a row in storage_time_series
if those values are not present there yet:
Here is the query:
INSERT INTO storage_time_series (time, value, akd, ii)
SELECT data.time, data.value, data.akd, data.ii FROM
( SELECT
now() AS time,
SUM(value) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM data_time_series
GROUP BY akd, ii) AS data
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii );
The first part of the query produces around 200K rows and is fast:
SELECT
now() AS time,
SUM(nullif(value, 'NaN')) AS value,
labels[2] AS akd,
labels[4] AS ii
FROM public.data_time_series
GROUP BY akd, ii
The second part of the query checks if the data exists in storage_time_series
. This part is too slow:
WHERE NOT EXISTS (
SELECT 1 FROM storage_time_series AS dst WHERE dst.value = data.value AND dst.access_key_id = data.akd AND dst.instance_id = data.ii)
storage_time_series
is a huge table and I need to keep the data and akd
and ii
are indexed on storage_time_series
.
Here is the tables schema:
SELECT version() :
PostgreSQL 12.9 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
CREATE TABLE storage_time_series
(
"time" timestamp with time zone,
value double precision,
akd integer,
ii integer
)
CREATE INDEX
ON storage_time_series USING btree
(akd ASC NULLS LAST, ii ASC NULLS LAST, value ASC NULLS LAST)
Number of rows on storage_time_series table: ~190 million
data_time_series is a view and has about 1.3 million rows.
Is there any way to speed up this process?
EDITED: I've already tested creating unique index
over akd, ii and value and changed query to simple INSERT and ON CONFLICT DO NOTHING, but seems having unique index
make insert process very slower than before.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
听起来不错。但是太模糊了,无法说更多。如果看来,
(akd,ii,value)
应该是唯一的,请相应地创建唯一的索引
:然后您的查询将燃烧至:
哪个应该快速。很大程度上取决于未公开的细节。
Sounds good. But too vague to say more. If, as it seems,
(akd, ii, value)
is supposed to be unique, create aUNIQUE INDEX
accordingly:Then your query burns down to:
Which should be fast. Much depends undisclosed details.