PostgreSql：具有相同列的不同查询计划

发布于 2024-09-07 19:08:37 字数 663 浏览 7 评论 0原文

我有一个有 2 个外键的表，我们称它们为 fk1 和 fk2。两者具有相同的类型和相同的索引。但是当我“解释”一个简单的选择查询时，我得到了完全不同的查询计划。对于 fk1：

explain select * from mytable where fk1 = 1;

结果

Index Scan using fk1_idx on mytable  (cost=0.00..9.32 rows=2 width=4)
  Index Cond: (fk1 = 1)

对于 fk2

explain select * from mytable where fk2 = 1;

结果：

Bitmap Heap Scan on mytable  (cost=5.88..659.18 rows=208 width=4)
  Recheck Cond: (fk2 = 1)
  ->  Bitmap Index Scan on fk2_idx  (cost=0.00..5.83 rows=208 width=0)
        Index Cond: (fk2 = 1)

第二个似乎效率较低。是否因为它可能返回更多结果，因此更复杂的查询会得到回报？

原文

I have a table with 2 foreign keys, lets call them fk1 and fk2. Both have identical types, and identical indices. But when I "explain" a simple select query, I get completely different query plans. For fk1:

explain select * from mytable where fk1 = 1;

Result

Index Scan using fk1_idx on mytable  (cost=0.00..9.32 rows=2 width=4)
  Index Cond: (fk1 = 1)

For fk2

explain select * from mytable where fk2 = 1;

Result:

Bitmap Heap Scan on mytable  (cost=5.88..659.18 rows=208 width=4)
  Recheck Cond: (fk2 = 1)
  ->  Bitmap Index Scan on fk2_idx  (cost=0.00..5.83 rows=208 width=0)
        Index Cond: (fk2 = 1)

The second one seems to be more inefficient. Is it due to the fact that it potentially returns more results, and thus the more complex query pays off?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

早茶月光 2024-09-14 19:08:37

是的，这归结为谓词的“选择性”（“where ...”子句）。

如果谓词仅选择表中行的一小部分，则可以通过任意顺序访问表数据来单独获取每一行，因为只会获取几页。

随着要选择的行数增加，位图扫描变得更合适：索引用于确定表中的哪些页“有趣”，然后按照它们在表数据文件中的放置顺序扫描这些页。这样做的优点是可以一起请求相邻页面，文件系统/磁盘系统可能会更有效地提供服务。（当然，这取决于表数据文件是否合理地未碎片化）。由于不保留每个页面中实际感兴趣的元组集，而仅保留页面集本身，因此必须为检索到的页面中的每个元组重新评估谓词：因此查询中的“重新检查条件”。（此策略的一个优点是，它允许组合多个独立索引中的查找，只需将多个索引查找中的“感兴趣的页面”位图简单地与（或或）在一起）。

随着要选择的行数进一步增加，扫描索引的优势逐渐减弱，因为可能的结果是表的大部分内容将被标记为“有趣”。因此，最终，简单的顺序扫描变得合适：按顺序遍历所有页面，并忽略索引。

IIRC 这往往表明，请求少于 15% 的表可能是索引扫描，15-50% 是位图扫描，50% 以上是序列扫描。非常粗略。这很大程度上受 random_page_cost 和 seq_page_cost 等相对设置（例如 effective_cache_size）的影响。

Postgresql 收集有关常见值及其频率的统计信息，以及数据库中每列的其他值的直方图 - 这用于估计选择性并填充您在 EXPLAIN 输出中看到的“行”估计。该文档包含如何完成此操作的描述： http:// /www.postgresql.org/docs/current/static/planner-stats-details.html

Yes, this comes down to what's called the "selectivity" of the predicate ("where ..." clause).

If the predicate is selecting only a small fraction of the rows in the table, it makes sense to fetch each one individually, with arbitrarily-ordered access to the table data, since only a few pages will be fetched.

As the number of rows to be selected increases, a bitmap scan becomes more appropriate: the index is used to determine which pages within the table are "interesting", and then those pages are scanned in the order they are placed in the table data file. This has the advantage that adjacent pages can be requested together, which is likely to be much more efficiently served by the filesystem/disk system. (Of course, this depends on the table data file being reasonably unfragmented). Since the set of actual interesting tuples within each page is not kept, just the set of pages themselves, the predicate has to be re-evaluated for each tuple in the retrieved pages: hence the "recheck cond" in the query. (One advantage of this strategy is that it allows lookups in multiple independent indices to be combined, by simply ANDing (or ORing) together the "interesting pages" bitmaps from several index lookups)

As the number of rows to be selected grows still further, the advantage of scanning the index dwindles as the likely result is seen to be that most of the table will be marked as "interesting". So eventually a plain sequential scan becomes appropriate: all the pages are walked through in order, and the index is ignored.

IIRC this tends to work out that requesting less than 15% of the table is likely an index scan, 15-50% a bitmap scan, 50%+ a seq scan. Very roughly. This is much affected by the relative settings of random_page_cost and seq_page_cost, amongst others (e.g. effective_cache_size).

Postgresql collects statistics about common values and their frequencies, and histograms of other values for each column in the database- this is used to estimate selectivity and populates the "rows" estimate you seen in EXPLAIN output. The documentation contains a description of how this is done: http://www.postgresql.org/docs/current/static/planner-stats-details.html

回复收藏 0 原文