Postgresql - 如果记录不存在则插入记录,如果存在则更新记录的干净方法

发布于 2024-12-05 17:44:33 字数 1520 浏览 6 评论 0原文

这是我的情况。我有一个包含一堆 URL 的表并进行爬行 与它们相关的日期。当我的程序处理 URL 时,我想要 插入带有爬网日期的新行。如果 URL 已经存在,我 想要将爬网日期更新为当前日期时间。使用 MS SQL 或 Oracle 我可能会为此使用 MERGE 命令。使用 mySQL 我会 可能使用 ON DUPLICATE KEY UPDATE 语法。

我可以在我的程序中执行多个查询,这可能是也可能不是 线程安全。我可以编写一个包含各种 IF...ELSE 的 SQL 函数 逻辑。然而,为了尝试 Postgres 功能,我 以前从未使用过,我正在考虑创建一个 INSERT 规则 - 像这样的东西:

CREATE RULE Pages_Upsert AS ON INSERT TO Pages
  WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
  DO INSTEAD
     UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;

这似乎确实效果很好。它可能会失去一些分数 “代码可读性”的观点,就像有人查看我的代码一样 第一次必须神奇地知道这个规则,但我 我猜这可以通过良好的代码注释来解决 文档。

这个想法还有其他缺点吗,或者可能是“你的想法” 太糟糕了,你应该这样做/这个/方式”评论?我在 PG 9.0 如果 这很重要。

更新:查询计划,因为有人想要它:)

"Insert  (cost=2.79..2.81 rows=1 width=0)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..0.01 rows=1 width=0)"
"        One-Time Filter: ($0 IS NOT TRUE)"
""
"Update  (cost=2.79..5.46 rows=1 width=111)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..2.67 rows=1 width=111)"
"        One-Time Filter: $0"
"        ->  Seq Scan on pages  (cost=0.00..2.66 rows=1 width=111)"
"              Filter: ((url)::text = 'http://www.foo.com'::text)"

Here's my situation. I have a table with a bunch of URLs and crawl
dates associated with them. When my program processes a URL, I want
to INSERT a new row with a crawl date. If the URL already exists, I
want to update the crawl date to the current datetime. With MS SQL or
Oracle I'd probably use a MERGE command for this. With mySQL I'd
probably use the ON DUPLICATE KEY UPDATE syntax.

I could do multiple queries in my program, which may or may not be
thread safe. I could write a SQL function which has various IF...ELSE
logic. However, for the sake of trying out Postgres features I've
never used before, I'm thinking about creating an INSERT rule -
something like this:

CREATE RULE Pages_Upsert AS ON INSERT TO Pages
  WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
  DO INSTEAD
     UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;

This seems to actually work great. It probably loses some points on
the "code readability" standpoint, as someone looking at my code for
the first time would have to magically know about this rule, but I
guess that could be solved with good code commenting and
documentation.

Are there any other drawbacks to this idea, or maybe a "your idea
sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if
that matters.

UPDATE: Query plan since someone wanted it :)

"Insert  (cost=2.79..2.81 rows=1 width=0)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..0.01 rows=1 width=0)"
"        One-Time Filter: ($0 IS NOT TRUE)"
""
"Update  (cost=2.79..5.46 rows=1 width=111)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..2.67 rows=1 width=111)"
"        One-Time Filter: $0"
"        ->  Seq Scan on pages  (cost=0.00..2.66 rows=1 width=111)"
"              Filter: ((url)::text = 'http://www.foo.com'::text)"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

紙鸢 2024-12-12 17:44:33

好的,我成功创建了一个测试用例。结果是更新部分总是被执行,即使是在新插入时也是如此。 COPY 似乎绕过了规则系统。
[为了清楚起见,我已将其放入单独的回复中]

DROP TABLE pages CASCADE;
CREATE TABLE pages
    ( url VARCHAR NOT NULL  PRIMARY KEY
    , html VARCHAR
    , last TIMESTAMP
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );

CREATE RULE Pages_Upsert AS ON INSERT TO pages
  WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
     DO INSTEAD (
     UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;

COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX     stdin   2000-09-18 23:30:00
\.

SELECT * FROM pages;

结果:

              url              |    html    |            last            
-------------------------------+------------+----------------------------
 www.example.com://page1       | meuk1      | 2001-09-18 23:30:00
 www.example.com://page2       | meuk2      | 2011-09-18 23:48:30.775373
 www.example.com://page3       | meuk3      | 2011-09-18 23:48:30.783758
 www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
 www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
 www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
 www.example.com://pageX       | stdin      | 2000-09-18 23:30:00
 (7 rows)

更新:只是为了证明它可以完成:

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);

CREATE RULE Pages_Upsert AS ON INSERT TO vpages
  DO INSTEAD (
     UPDATE pages p0
     SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
    ;
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
    );

CREATE RULE Pages_Indate AS ON UPDATE TO vpages
  DO INSTEAD (
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
        ;
     UPDATE pages p0
     SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
        ;
    );

INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;

-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX    stdin    2000-09-18 23:30:00
-- \.

SELECT * FROM vpages;

结果:

INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
              url              |    html    |        last         
-------------------------------+------------+---------------------
 www.example.com://page2       | meuk2      | 2002-09-18 23:30:00
 www.example.com://page3       | meuk3      | 2003-09-18 23:30:00
 www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
 www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
 www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
 www.example.com://page1       | meuk1      | 1991-09-18 23:30:00
(6 rows)

该视图对于防止重写系统进入递归是必要的。
DELETE 规则的构造留给读者作为练习。

Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system.
[For clarity I have put this into a separate reply]

DROP TABLE pages CASCADE;
CREATE TABLE pages
    ( url VARCHAR NOT NULL  PRIMARY KEY
    , html VARCHAR
    , last TIMESTAMP
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );

CREATE RULE Pages_Upsert AS ON INSERT TO pages
  WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
     DO INSTEAD (
     UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;

COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX     stdin   2000-09-18 23:30:00
\.

SELECT * FROM pages;

The result:

              url              |    html    |            last            
-------------------------------+------------+----------------------------
 www.example.com://page1       | meuk1      | 2001-09-18 23:30:00
 www.example.com://page2       | meuk2      | 2011-09-18 23:48:30.775373
 www.example.com://page3       | meuk3      | 2011-09-18 23:48:30.783758
 www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
 www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
 www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
 www.example.com://pageX       | stdin      | 2000-09-18 23:30:00
 (7 rows)

UPDATE: Just to prove it can be done:

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);

CREATE RULE Pages_Upsert AS ON INSERT TO vpages
  DO INSTEAD (
     UPDATE pages p0
     SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
    ;
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
    );

CREATE RULE Pages_Indate AS ON UPDATE TO vpages
  DO INSTEAD (
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
        ;
     UPDATE pages p0
     SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
        ;
    );

INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;

-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX    stdin    2000-09-18 23:30:00
-- \.

SELECT * FROM vpages;

Result:

INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
              url              |    html    |        last         
-------------------------------+------------+---------------------
 www.example.com://page2       | meuk2      | 2002-09-18 23:30:00
 www.example.com://page3       | meuk3      | 2003-09-18 23:30:00
 www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
 www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
 www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
 www.example.com://page1       | meuk1      | 1991-09-18 23:30:00
(6 rows)

The view is necessary to prevent the rewrite system to go into recursion.
Construction of a DELETE rule is left as an exercise to the reader.

云裳 2024-12-12 17:44:33

来自应该了解这一点或非常接近这样的人的一些好观点;-)

PostgreSQL RULE 有何用处?

短篇故事:

  • 规则与 SERIAL 配合使用效果好吗? BIGSERIAL
  • 这些规则与 INSERTUPDATERETURNING 子句配合得很好吗?
  • 这些规则是否适用于诸如 random() 之类的东西?

所有这些事情都归结为一个事实,即规则系统不是行驱动的,而是以您从未想象过的方式转换您的语句。

帮自己和团队成员一个忙,不要再用角色来做类似的事情了。

编辑:您的问题在 PostgreSQL 社区中得到了充分讨论。搜索关键字为:MERGEUPSERT

Some good points from someone who should know it or be very near to someone like that ;-)

What are PostgreSQL RULEs good for?

Short story:

  • Do the rules work well with SERIAL and BIGSERIAL ?
  • Do the rules work well with the RETURNING clauses of INSERT and UPDATE ?
  • Do the rules work well with stuff like random()?

All these things boils down to the fact, that the rule system is not row driven but transforms your statements in a way you never imagine.

Do yourself and your team mates a favour and stop using roles for things like that.

Edit: Your problem is well discussed in the PostgreSQL community. Search keywords are: MERGE, UPSERT.

北座城市 2024-12-12 17:44:33

我不知道这是否太主观,但我对您的解决方案的看法是:这都是关于语义的。当我进行插入时,我期望插入,而不是一些可能执行插入但可能不执行的奇特逻辑。事实上,这就是函数的用途。

首先,我会尝试检查程序中的 URL,然后选择是否插入或更新。如果结果太慢,我会使用一个函数。如果您将其命名为 insert_or_update_url,您将自动免费获得一些文档。重写规则要求你有一些隐性知识,我通常会尽量避免这种情况。

从好的方面来说:如果有人复制数据但忘记了规则和函数,您的解决方案可能会悄无声息地崩溃(但这可能取决于其他约束),但缺少的函数会尖叫着崩溃。不要误会我的意思,我认为你的解决方案非常有创意和聪明。只是对我的口味来说有点太晦涩了。

I don't know if this gets too subjective but what I think about your solution is: It's all about semantics. When I do an insert, I expect an insert and not some fancy logic that maybe does an insert but maybe not. Indeed that's what functions are for.

At first I'd try checking for the URL in your program and then choosing whether to insert or update. If that turned out to be too slow, I'd use a function. If you name it like insert_or_update_url, you automatically get some documentation for free. The rewrite rule requires you to have some implicit knowledge and I generally try to avoid that.

On the plus side: If someone copies the data but forgets rules and functions, your solution might break silently (but that may depend on other constraints), but a missing function goes down screaming. Don't get me wrong, I think your solution is very creative and smart. Just a bit too obscure for my taste.

秋凉 2024-12-12 17:44:33

有一个实现upsert的示例 / 合并使用Postgres文档中的简单函数

永远不要使用规则——它们是邪恶的。

There's an example of implementing upsert / merge using simple function in Postgres documentation.

Never use rules — they're evil.

小鸟爱天空丶 2024-12-12 17:44:33

您不能在规则资格中引用除旧表和新表之外的其他表。
您应该在规则正文中执行此操作。
这都是因为规则只是一种通知重写系统应该执行和不应该执行哪些转换的方法。规则不是针对每一行执行的触发器,但它们为查询计划程序提供了良好的信息,并要求它很好地重写计划。
来自文档:

什么是规则资格?它是一种限制,指示何时应该执行规则的操作,何时不执行。此限定只能引用伪关系 NEW 和/或 OLD,它们基本上表示作为对象给出的关系(但具有特殊含义)。

You cannot refer to other tables than old an new in the rule qualification.
You should instead do this in the rule body.
This is all because the rule is just a way to inform the rewrite system about what transformations it should and should not perform. Rules are not triggers, executing for every row, but they give the query planner a fine massage and ask it nicely to rewrite the plan.
From the docs:

What is a rule qualification? It is a restriction that tells when the actions of the rule should be done and when not. This qualification can only reference the pseudorelations NEW and/or OLD, which basically represent the relation that was given as object (but with a special meaning).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文