当前位置：文江博客话题详情

Postgres ANTI-JOIN 需要表扫描吗？

发布于 2024-10-21 07:44:56 字数 297 浏览 4 评论 0原文

我需要在同一个表上使用 ANTI-JOIN （不存在从表中选择某些内容.../左连接表 WHERE table.id IS NULL）。实际上，我有一个索引来解决不存在的问题，但查询规划器选择使用位图堆扫描。

该表有 1 亿行，因此进行堆扫描会很混乱......

如果 Postgres 可以与索引进行比较，那将会非常快。 Postgres 是否必须访问此 ANTI-JOIN 的表？

我知道必须在某个时候访问该表才能为 MVCC 提供服务，但为什么这么早呢？ NOT EXISTS 只能由表来修复吗，因为否则它可能会丢失一些东西？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦在夏天 2024-10-28 07:44:56

您需要提供版本详细信息，正如 jmz 所说的 EXPLAIN ANALYZE 输出以获得任何有用的建议。

弗兰兹 - 不要想是否可能，测试一下就知道了。

这是 v9.0：

CREATE TABLE tl (i int, t text);
CREATE TABLE tr (i int, t text);
INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s;
INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0;
ALTER TABLE tl add primary key (i);
CREATE INDEX tr_i_idx ON tr (i);
ANALYSE;
EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL;
                                                         QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1)
   Merge Cond: (tl.i = tr.i)
   ->  Index Scan using tl_pkey on tl  (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo
   ->  Index Scan using tr_i_idx on tr  (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop
 Total runtime: 4602.224 ms

您看到的内容将取决于您的版本以及规划器看到的统计信息。

You'll need to provide version details, and as jmz says EXPLAIN ANALYSE output to get any useful advice.

Franz - don't think whether it's possible, test and know.

This is v9.0:

CREATE TABLE tl (i int, t text);
CREATE TABLE tr (i int, t text);
INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s;
INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0;
ALTER TABLE tl add primary key (i);
CREATE INDEX tr_i_idx ON tr (i);
ANALYSE;
EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL;
                                                         QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1)
   Merge Cond: (tl.i = tr.i)
   ->  Index Scan using tl_pkey on tl  (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo
   ->  Index Scan using tr_i_idx on tr  (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop
 Total runtime: 4602.224 ms

What you see will depend on your version, and the stats the planner sees.

回复收藏 0 原文

无敌元气妹 2024-10-28 07:44:56

我的（简化的）查询：

SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id;

像这样的查询计划有效：

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.57..3831.88 rows=128092 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.42..3399.70 rows=130352 width=8)
   ->  Index Only Scan using b_pkey on b  (cost=0.15..78.06 rows=2260 width=8)
(4 rows)

但是，如果规划器认为它可能更好，有时 postgresql 9.5.9 会切换到顺序扫描（请参阅为什么 PostgreSQL 对索引列执行顺序扫描？）。然而，就我而言，这让事情变得更糟。

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=405448.22..39405858.08 rows=1365191502 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.58..35528317.86 rows=1368180352 width=8)
   ->  Materialize  (cost=405447.64..420391.89 rows=2988850 width=8)
         ->  Sort  (cost=405447.64..412919.76 rows=2988850 width=8)
               Sort Key: b.id
               ->  Seq Scan on b  (cost=0.00..43113.50 rows=2988850 width=8)
(7 rows)

我的（黑客）解决方案是通过以下方式阻止顺序扫描：

set enable_seqscan to off;

postgresql 文档说执行此操作的正确方法是使用 ALTER TABLESPACE 对 seq_page_cost 进行操作。在索引列上使用 ORDER BY 时这可能是明智的，但我不确定。 https://www.postgresql.org/docs/9.1/静态/runtime-config-query.html

My (simplified) query:

SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id;

The query plan like this works:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.57..3831.88 rows=128092 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.42..3399.70 rows=130352 width=8)
   ->  Index Only Scan using b_pkey on b  (cost=0.15..78.06 rows=2260 width=8)
(4 rows)

However, sometimes postgresql 9.5.9 would switch to a sequential scan if the planner thought it might be better (see Why does PostgreSQL perform sequential scan on indexed column?). However, in my case it made things worse.

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=405448.22..39405858.08 rows=1365191502 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.58..35528317.86 rows=1368180352 width=8)
   ->  Materialize  (cost=405447.64..420391.89 rows=2988850 width=8)
         ->  Sort  (cost=405447.64..412919.76 rows=2988850 width=8)
               Sort Key: b.id
               ->  Seq Scan on b  (cost=0.00..43113.50 rows=2988850 width=8)
(7 rows)

My (hack) solution was to discourage sequential scans by:

set enable_seqscan to off;

The postgresql documentation says the proper way to do this is to the seq_page_cost using ALTER TABLESPACE. This might be advisable when using ORDER BY on indexed columns, but I'm not sure. https://www.postgresql.org/docs/9.1/static/runtime-config-query.html

回复收藏 0 原文

~没有更多了~