为什么 PostgreSQL 对索引列执行顺序扫描?
非常简单的例子 - 一张表,一个索引,一个查询:
CREATE TABLE book
(
id bigserial NOT NULL,
"year" integer,
-- other columns...
);
CREATE INDEX book_year_idx ON book (year)
EXPLAIN
SELECT *
FROM book b
WHERE b.year > 2009
给我:
Seq Scan on book b (cost=0.00..25663.80 rows=105425 width=622)
Filter: (year > 2009)
为什么它不执行索引扫描? 我缺少什么?
Very simple example - one table, one index, one query:
CREATE TABLE book
(
id bigserial NOT NULL,
"year" integer,
-- other columns...
);
CREATE INDEX book_year_idx ON book (year)
EXPLAIN
SELECT *
FROM book b
WHERE b.year > 2009
gives me:
Seq Scan on book b (cost=0.00..25663.80 rows=105425 width=622)
Filter: (year > 2009)
Why it does NOT perform index scan instead?
What am I missing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果 SELECT 返回表中所有行的大约 5-10% 以上,则顺序扫描比索引扫描快得多。
这是因为索引扫描需要对每一行进行多次 IO 操作(在索引中查找该行,然后从堆中检索该行)。而顺序扫描每行只需要一个 IO,甚至更少,因为磁盘上的一个块(页)包含多个行,因此可以通过单个 IO 操作获取多个行。
顺便说一句:对于其他 DBMS 来说也是如此 - 一些优化作为“仅索引扫描”被搁置(但对于 SELECT * 来说,这样的 DBMS 不太可能进行“仅索引扫描”)
If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan.
This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap). Whereas a sequential scan only requires a single IO for each row - or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation.
Btw: this is true for other DBMS as well - some optimizations as "index only scans" taken aside (but for a SELECT * it's highly unlikely such a DBMS would go for an "index only scan")
您是否分析表/数据库?那么统计呢?当年份 > 的记录很多时2009 年,顺序扫描可能比索引扫描更快。
Did you ANALYZE the table/database? And what about the statistics? When there are many records where year > 2009, a sequential scan might be faster than an index scan.
@a_horse_with_no_name 解释得很好。另外,如果您确实想使用索引扫描,通常应该在 where 子句中使用有界范围。例如-
年> 2019年及年< 2020 年。
很多时候表上的统计信息不会更新,而且由于限制可能无法更新。在这种情况下,优化器将不知道在year > 中应该采取多少行。 2019。因此它选择顺序扫描来代替完整知识。有界分区在大多数情况下可以解决问题。
@a_horse_with_no_name explained it quite well. Also if you really want to use an index scan, you should generally use bounded ranges in where clause. eg -
year > 2019 and year < 2020.
A lot of the times statistics are not updated on a table and it may not be possible to do so due to constraints. In this case, the optimizer will not know how many rows it should take in year > 2019. Thus it selects a sequential scan in lieu of full knowledge. Bounded partitions will solve the problem most of the time.
在索引扫描中,读头从一行跳转到另一行,这比读取下一个物理块(在顺序扫描中)慢 1000 倍。
因此,如果(要检索的记录数 * 1000)小于记录总数,则索引扫描的性能会更好。
In index scan, read head jumps from one row to another which is 1000 times slower than reading the next physical block (in the sequential scan).
So, if the (number of records to be retrieved * 1000) is less than the total number of records, the index scan will perform better.