postgres 上的慢速选择不同查询
我在一个基本上收集日志信息的表上经常执行以下两个查询。两者都从大量行中选择不同的值,但其中的不同值少于 10 个。
我分析了页面执行的两个“不同”查询:
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13)
(2 rows)
两者都对列进行顺序扫描。但是,如果我关闭enable_seqscan(尽管名称如此,这只会禁用对具有索引的列进行序列扫描)查询将使用索引,但速度会更慢:
marchena=> set enable_seqscan = off;
SET
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..19613740.62 rows=6 width=21)
-> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..45851449.96 rows=5 width=13)
-> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13)
(2 rows)
bundle_id 和 server_name 列上都有 btree 索引,我应该使用不同的类型吗索引可以快速选择不同的值?
I'm doing the following two queries quite frequently on a table that essentially gathers up logging information. Both select distinct values from a huge number of rows but with less than 10 different values in those.
I've analyzed both "distinct" queries done by the page:
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13)
(2 rows)
Both do sequence scans of the columns. However if I turn off enable_seqscan (dispite the name this only disables doing sequence scans on columns with indices) the query uses the index, but is even slower:
marchena=> set enable_seqscan = off;
SET
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..19613740.62 rows=6 width=21)
-> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..45851449.96 rows=5 width=13)
-> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13)
(2 rows)
Both bundle_id and server_name columns have btree indices on them, should I be using a different type of index to make selecting distinct values fast?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
PG(还)不能使用不同的索引(跳过相同的值),但你可以这样做:
PG can't (yet) use an index for distinct (skipping the identical values) but you can do this :
您从整个表中选择不同的值,这会自动导致顺序扫描。你有数百万行,所以它一定会很慢。
有一个技巧可以更快地获取不同的值,但它仅在数据具有已知(且相当小)的可能值集时才有效。例如,我认为您的bundle_id引用了某种较小的bundles表。这意味着您可以编写:
这应该导致对包进行嵌套循环/序列扫描 ->使用bundle_id 上的索引对audit_records 进行索引扫描。
You're selecting distinct values from the whole table, which automatically leads to a seq scan. You've millions rows, so it'll necessarily be slow.
There's a trick to get the distinct values faster, but it only works when the data has a known (and reasonably small) set of possible values. For instance, I take it that your bundle_id references some kind of bundles table which is a smaller. This means you can write:
This should lead to a nested loop / seq scan on bundles -> index scan on audit_records using the index on bundle_id.
我对表也有同样的问题> 3 亿条记录和一个包含几个不同值的索引字段。我无法摆脱 seq 扫描,因此我使用此函数来模拟使用索引(如果存在)的不同搜索。如果您的表有许多与记录总数成比例的不同值,则此函数不好。它还必须针对多列不同值进行调整。 警告:此函数对 SQL 注入完全开放,只能在安全环境中使用。
解释分析结果:
使用正常 SELECT DISTINCT 进行查询:总运行时间:598310.705 ms
使用 SELECT Small_distinct(...) 进行查询:总运行时间:1.156 毫秒
调用示例:
I have the same problem with tables > 300 millions records and an indexed field with a few distinct values. I couldn't get rid of the seq scan so I made this function to simulate a distinct search using the index if it exists. If your table has a number of distinct values proportional to the total number of records, this function isn't good. It also has to be adjusted for multi-columns distinct values. Warning: This function is wide open to sql injection and should only be used in a securized environment.
Explain analyze results:
Query with normal SELECT DISTINCT: Total runtime: 598310.705 ms
Query with SELECT small_distinct(...): Total runtime: 1.156 ms
Call samples:
在 PostgreSQL 9.3 上,从 Denis 的回答开始:
只需向子查询添加“限制 1”,我就获得了 60 倍的加速(对于我的用例,有 800 万条记录、复合索引和 10k 组合),从 1800 毫秒开始至 30 毫秒:
On PostgreSQL 9.3, starting from the answer from Denis:
just by adding a 'limit 1' to the subquery, I got a 60x speedup (for my use case, with 8 million records, a composite index and 10k combinations), going from 1800ms to 30ms: