Postgres选择查询进行顺序扫描，而不是在桌子上使用1800万行的索引扫描

发布于 2025-01-25 23:38:04 字数 991 浏览 5 评论 0原文

我有一个Postgres表，该表有近1800万行，我试图运行此查询，

select * from answer where submission_id = 5 and deleted_at is NULL;

列cummission_id的表上有一个部分索引。这是用于创建索引的命令，

CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)

这是上述选择查询的解释分析

Gather  (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=2144966 read=3
  I/O Timings: read=6.169
  ->  Parallel Seq Scan on answer  (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
        Filter: ((deleted_at IS NULL) AND (submission_id = 5))
        Rows Removed by Filter: 62213625
        Buffers: shared hit=2144966 read=3
        I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms

，理想情况下应该选择答案_submission_id索引。但是Postgres正在进行顺序扫描。

任何帮助真的很感激

原文

I have a postgres table which has almost 18 Million rows and I am trying to run this query

select * from answer where submission_id = 5 and deleted_at is NULL;

There is an partial index on the table on column submission_id. This is the command used to create index

CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)

This is the explain analyse of the above select query

Gather  (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=2144966 read=3
  I/O Timings: read=6.169
  ->  Parallel Seq Scan on answer  (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
        Filter: ((deleted_at IS NULL) AND (submission_id = 5))
        Rows Removed by Filter: 62213625
        Buffers: shared hit=2144966 read=3
        I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms

Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.

Any help would be really thankful

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉默的熊 2025-02-01 23:38:04

执行计划表明我们在估计的读取行与实际读取行之间存在偏差。

PostgreSQL Optimizer是一个基于成本的优化器（CBO）查询，将通过执行计划中最小的成本执行。

因此，错误的统计数据可能会选择不良的执行计划。

有一个链接来表示错误的统计信息，导致查询缓慢。
为什么Postgres中的不良行估计很慢？ >

首先，我将使用此查询搜索last_analyze＆amp; last_vacuum上次。

SELECT
  schemaname, relname,
  last_vacuum, last_autovacuum,
  vacuum_count, autovacuum_count,
  last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';

如果您的统计数据错了，我们可以使用分析“ TaberName”来帮助我们从表中收集新统计信息， 分析 扫描速度取决于表尺寸。

对于大表格，分析是随机的表内容样本，而不是检查每个行。这允许在少量时间内分析非常大的表。但是，请注意，即使实际表内容不更改，统计数据仅近似，并且每次进行分析时都会稍微改变。这可能会导致策划者所显示的估计成本的微小变化。在极少数情况下，这种非确定性将导致计划者在进行分析后更改的查询计划的选择。为避免这种情况，请提高分析收集的统计量，如下所述。

当我们更新和delete数据将创建可能存在于堆或索引中的死元元组，但我们无法查询，vacuum可以帮助我们收回死亡元素所占据的存储空间。

The execution plan shows us there is a deviation between the estimated read row and the actual read row.

Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.

so that the wrong statistics might choose a bad execution plan.

There is a link to represent the wrong statistics causing a slow query.
Why are bad row estimates slow in Postgres?

Firstly I will use this query to search last_analyze & last_vacuum last time.

SELECT
  schemaname, relname,
  last_vacuum, last_autovacuum,
  vacuum_count, autovacuum_count,
  last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';

if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.

For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.

When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

回复收藏 0 原文

~没有更多了~