Postgres ANTI-JOIN 需要表扫描吗?

发布于 2024-10-21 07:44:56 字数 297 浏览 1 评论 0原文

我需要在同一个表上使用 ANTI-JOIN (不存在从表中选择某些内容.../左连接表 WHERE table.id IS NULL)。实际上,我有一个索引来解决不存在的问题,但查询规划器选择使用位图堆扫描。

该表有 1 亿行,因此进行堆扫描会很混乱......

如果 Postgres 可以与索引进行比较,那将会非常快。 Postgres 是否必须访问此 ANTI-JOIN 的表?

我知道必须在某个时候访问该表才能为 MVCC 提供服务,但为什么这么早呢? NOT EXISTS 只能由表来修复吗,因为否则它可能会丢失一些东西?

I need an ANTI-JOIN (not exists SELECT something from table.../ left join table WHERE table.id IS NULL) on the same table. Acutally I have an index to serve the not exists question, but the query planner chooses to use a bitmap heap scan.

The table has 100 Million rows, so doing a heap scan is messed up...

It would be really fast if Postgres could compare to the indicies. Does Postgres have to visit the table for this ANTI-JOIN?

I know the table has to be visited at some point to serve the MVCC, but why so early? Can NOT EXISTS only be fixed by the table, because it could miss something otherwise?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦在夏天 2024-10-28 07:44:56

您需要提供版本详细信息,正如 jmz 所说的 EXPLAIN ANALYZE 输出以获得任何有用的建议。

弗兰兹 - 不要想是否可能,测试一下就知道了。

这是 v9.0:

CREATE TABLE tl (i int, t text);
CREATE TABLE tr (i int, t text);
INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s;
INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0;
ALTER TABLE tl add primary key (i);
CREATE INDEX tr_i_idx ON tr (i);
ANALYSE;
EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL;
                                                         QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1)
   Merge Cond: (tl.i = tr.i)
   ->  Index Scan using tl_pkey on tl  (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo
   ->  Index Scan using tr_i_idx on tr  (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop
 Total runtime: 4602.224 ms

您看到的内容将取决于您的版本以及规划器看到的统计信息。

You'll need to provide version details, and as jmz says EXPLAIN ANALYSE output to get any useful advice.

Franz - don't think whether it's possible, test and know.

This is v9.0:

CREATE TABLE tl (i int, t text);
CREATE TABLE tr (i int, t text);
INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s;
INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0;
ALTER TABLE tl add primary key (i);
CREATE INDEX tr_i_idx ON tr (i);
ANALYSE;
EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL;
                                                         QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1)
   Merge Cond: (tl.i = tr.i)
   ->  Index Scan using tl_pkey on tl  (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo
   ->  Index Scan using tr_i_idx on tr  (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop
 Total runtime: 4602.224 ms

What you see will depend on your version, and the stats the planner sees.

无敌元气妹 2024-10-28 07:44:56

我的(简化的)查询:

SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id;

像这样的查询计划有效:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.57..3831.88 rows=128092 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.42..3399.70 rows=130352 width=8)
   ->  Index Only Scan using b_pkey on b  (cost=0.15..78.06 rows=2260 width=8)
(4 rows)

但是,如果规划器认为它可能更好,有时 postgresql 9.5.9 会切换到顺序扫描(请参阅 为什么 PostgreSQL 对索引列执行顺序扫描?)。然而,就我而言,这让事情变得更糟。

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=405448.22..39405858.08 rows=1365191502 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.58..35528317.86 rows=1368180352 width=8)
   ->  Materialize  (cost=405447.64..420391.89 rows=2988850 width=8)
         ->  Sort  (cost=405447.64..412919.76 rows=2988850 width=8)
               Sort Key: b.id
               ->  Seq Scan on b  (cost=0.00..43113.50 rows=2988850 width=8)
(7 rows)

我的(黑客)解决方案是通过以下方式阻止顺序扫描:

set enable_seqscan to off;

postgresql 文档说执行此操作的正确方法是使用 ALTER TABLESPACE 对 seq_page_cost 进行操作。在索引列上使用 ORDER BY 时这可能是明智的,但我不确定。 https://www.postgresql.org/docs/9.1/静态/runtime-config-query.html

My (simplified) query:

SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id;

The query plan like this works:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.57..3831.88 rows=128092 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.42..3399.70 rows=130352 width=8)
   ->  Index Only Scan using b_pkey on b  (cost=0.15..78.06 rows=2260 width=8)
(4 rows)

However, sometimes postgresql 9.5.9 would switch to a sequential scan if the planner thought it might be better (see Why does PostgreSQL perform sequential scan on indexed column?). However, in my case it made things worse.

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=405448.22..39405858.08 rows=1365191502 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.58..35528317.86 rows=1368180352 width=8)
   ->  Materialize  (cost=405447.64..420391.89 rows=2988850 width=8)
         ->  Sort  (cost=405447.64..412919.76 rows=2988850 width=8)
               Sort Key: b.id
               ->  Seq Scan on b  (cost=0.00..43113.50 rows=2988850 width=8)
(7 rows)

My (hack) solution was to discourage sequential scans by:

set enable_seqscan to off;

The postgresql documentation says the proper way to do this is to the seq_page_cost using ALTER TABLESPACE. This might be advisable when using ORDER BY on indexed columns, but I'm not sure. https://www.postgresql.org/docs/9.1/static/runtime-config-query.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文