SQL 加速我的更新 - PostgresQL 中的 Left Join

发布于 2025-01-13 22:43:45 字数 1703 浏览 5 评论 0原文

我正在尝试使用两个数据帧之间共享的两个 id（patent_id、encounter_id）将一个数据帧连接到另一个数据帧。两个数据帧都在这些 id 上建立索引。

这是左侧：


tnx_prophy=# \d diagnosis
                       Table "public.diagnosis"
            Column             | Type | Collation | Nullable | Default
-------------------------------+------+-----------+----------+---------
 patient_id                    | text |           |          |
 encounter_id                  | text |           |          |
 code_system                   | text |           |          |
 code                          | text |           |          |
 principal_diagnosis_indicator | text |           |          |
 date                          | text |           |          |
Indexes:
    "idx_pt_enc_dx" btree (patient_id, encounter_id)

这是右侧：

tnx_prophy=# \d encounter
               Table "public.encounter"
    Column    | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
 encounter_id | text |           |          |
 patient_id   | text |           |          |
 type         | text |           |          |
 enc_type     | text |           |          |
Indexes:
    "idx_pt_enc_enc" btree (patient_id, encounter_id)

数据集很大（~500m 行？），但我的 UPDATE 和 JOIN 函数似乎花费的时间比我想要的要长得多。是的，我想更新（不仅仅是生成临时表）

tnx_prophy=# ALTER TABLE diagnosis ADD COLUMN enc_type text;
ALTER TABLE
tnx_prophy=# UPDATE diagnosis
tnx_prophy-# SET enc_type = encounter.enc_type
tnx_prophy-# FROM encounter
tnx_prophy-# WHERE (diagnosis.patient_id, diagnosis.encounter_id) = (encounter.patient_id, encounter.encounter_id);

关于如何更快地完成此操作有什么建议吗？或者我是否明显搞乱了这里的语法？如果有人可以提供帮助，非常感谢！

原文

I am trying to join one dataframe to another using two id's (patient_id, encounter_id) that are shared between them. Both dataframes are indexed on these ids.

Here is the LHS:


tnx_prophy=# \d diagnosis
                       Table "public.diagnosis"
            Column             | Type | Collation | Nullable | Default
-------------------------------+------+-----------+----------+---------
 patient_id                    | text |           |          |
 encounter_id                  | text |           |          |
 code_system                   | text |           |          |
 code                          | text |           |          |
 principal_diagnosis_indicator | text |           |          |
 date                          | text |           |          |
Indexes:
    "idx_pt_enc_dx" btree (patient_id, encounter_id)

Here is the RHS:

tnx_prophy=# \d encounter
               Table "public.encounter"
    Column    | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
 encounter_id | text |           |          |
 patient_id   | text |           |          |
 type         | text |           |          |
 enc_type     | text |           |          |
Indexes:
    "idx_pt_enc_enc" btree (patient_id, encounter_id)

The datasets are large (~500m rows?), but my UPDATE and JOIN function seems to be taking much longer than I would like. And yes, I would like to update (not just generate a temporary table)

tnx_prophy=# ALTER TABLE diagnosis ADD COLUMN enc_type text;
ALTER TABLE
tnx_prophy=# UPDATE diagnosis
tnx_prophy-# SET enc_type = encounter.enc_type
tnx_prophy-# FROM encounter
tnx_prophy-# WHERE (diagnosis.patient_id, diagnosis.encounter_id) = (encounter.patient_id, encounter.encounter_id);

Any advice on how to do this faster? Or am I messing up the syntax here somehow obvious? Thanks a ton if anyone can help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沩ん囻菔务 2025-01-20 22:43:45

\i tmp.sql

CREATE TABLE diagnosis
        ( patient_id    text
        , encounter_id  text
        -- , code_system        text
        -- , code       text
        , principal_diagnosis_indicator text
        -- , date       text
        );
CREATE INDEX idx_pt_enc_dx ON diagnosis (patient_id, encounter_id);

CREATE TABLE encounter
        ( encounter_id  text
        , patient_id    text
        , type  text
        , enc_type      text
        );
CREATE INDEX idx_pt_enc_enc ON encounter (patient_id, encounter_id);

INSERT INTO diagnosis(patient_id, encounter_id, principal_diagnosis_indicator) VALUES
 (1,1, 'influenza')
,(1,1, 'cancer')
,(2,1, 'influenza')
,(2,1, 'cancer')
        ;
INSERT INTO encounter(patient_id, encounter_id, enc_type) VALUES
 ( 1,1, 'OMG')
,( 1,1, 'WTF')
,( 2,1, 'WTF')
,( 2,1, 'OMG')
        ;

ALTER TABLE diagnosis ADD COLUMN enc_type text;

EXPLAIN ANALYZE
UPDATE diagnosis dst
SET enc_type = src.enc_type
FROM encounter src
WHERE (dst.patient_id, dst.encounter_id) = (src.patient_id, src.encounter_id)
AND dst.enc_type IS DISTINCT FROM src.enc_type -- both columns are NULLABLE
        ;

SELECT * FROM diagnosis;

结果：

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE INDEX
CREATE TABLE
CREATE INDEX
INSERT 0 4
INSERT 0 4
ALTER TABLE
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Update on diagnosis dst  (cost=0.30..47.59 rows=7 width=140) (actual time=0.383..0.385 rows=0 loops=1)
   ->  Merge Join  (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
         Merge Cond: ((dst.patient_id = src.patient_id) AND (dst.encounter_id = src.encounter_id))
         Join Filter: (dst.enc_type IS DISTINCT FROM src.enc_type)
         ->  Index Scan using idx_pt_enc_dx on diagnosis dst  (cost=0.15..21.15 rows=520 width=134) (actual time=0.066..0.082 rows=4 loops=1)
         ->  Index Scan using idx_pt_enc_enc on encounter src  (cost=0.15..21.15 rows=520 width=102) (actual time=0.051..0.086 rows=7 loops=1)
 Planning Time: 1.278 ms
 Execution Time: 0.858 ms
(8 rows)

 patient_id | encounter_id | principal_diagnosis_indicator | enc_type 
------------+--------------+-------------------------------+----------
 1          | 1            | cancer                        | WTF
 1          | 1            | influenza                     | WTF
 2          | 1            | cancer                        | OMG
 2          | 1            | influenza                     | OMG
(4 rows)

仔细观察

Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8loops=1)

行：8 行已更新，但只有4条记录！！发生这种情况是因为您的查找表中的搜索键不是唯一。每条记录都会更新两次，（并且顺序未定义......）！

\i tmp.sql

CREATE TABLE diagnosis
        ( patient_id    text
        , encounter_id  text
        -- , code_system        text
        -- , code       text
        , principal_diagnosis_indicator text
        -- , date       text
        );
CREATE INDEX idx_pt_enc_dx ON diagnosis (patient_id, encounter_id);

CREATE TABLE encounter
        ( encounter_id  text
        , patient_id    text
        , type  text
        , enc_type      text
        );
CREATE INDEX idx_pt_enc_enc ON encounter (patient_id, encounter_id);

INSERT INTO diagnosis(patient_id, encounter_id, principal_diagnosis_indicator) VALUES
 (1,1, 'influenza')
,(1,1, 'cancer')
,(2,1, 'influenza')
,(2,1, 'cancer')
        ;
INSERT INTO encounter(patient_id, encounter_id, enc_type) VALUES
 ( 1,1, 'OMG')
,( 1,1, 'WTF')
,( 2,1, 'WTF')
,( 2,1, 'OMG')
        ;

ALTER TABLE diagnosis ADD COLUMN enc_type text;

EXPLAIN ANALYZE
UPDATE diagnosis dst
SET enc_type = src.enc_type
FROM encounter src
WHERE (dst.patient_id, dst.encounter_id) = (src.patient_id, src.encounter_id)
AND dst.enc_type IS DISTINCT FROM src.enc_type -- both columns are NULLABLE
        ;

SELECT * FROM diagnosis;

Result:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE INDEX
CREATE TABLE
CREATE INDEX
INSERT 0 4
INSERT 0 4
ALTER TABLE
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Update on diagnosis dst  (cost=0.30..47.59 rows=7 width=140) (actual time=0.383..0.385 rows=0 loops=1)
   ->  Merge Join  (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
         Merge Cond: ((dst.patient_id = src.patient_id) AND (dst.encounter_id = src.encounter_id))
         Join Filter: (dst.enc_type IS DISTINCT FROM src.enc_type)
         ->  Index Scan using idx_pt_enc_dx on diagnosis dst  (cost=0.15..21.15 rows=520 width=134) (actual time=0.066..0.082 rows=4 loops=1)
         ->  Index Scan using idx_pt_enc_enc on encounter src  (cost=0.15..21.15 rows=520 width=102) (actual time=0.051..0.086 rows=7 loops=1)
 Planning Time: 1.278 ms
 Execution Time: 0.858 ms
(8 rows)

 patient_id | encounter_id | principal_diagnosis_indicator | enc_type 
------------+--------------+-------------------------------+----------
 1          | 1            | cancer                        | WTF
 1          | 1            | influenza                     | WTF
 2          | 1            | cancer                        | OMG
 2          | 1            | influenza                     | OMG
(4 rows)

Take a good look at the

Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)

line: 8 rows are updated, but there are only 4 records!! This happens because the search-key into your look up table is not unique . Every record is updated twice, (and the order is undefined ...)!

回复收藏 0 原文

~没有更多了~