在 postgres 中调整子查询

发布于 2024-11-07 05:09:05 字数 455 浏览 0 评论 0原文

我在数据库中发现了一些可疑数据。我试图确定某个字段(姓氏)是否正确。我在 postgres 中提出了以下查询:

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members."memberID" NOT IN (SELECT members."memberID" 
                                    FROM members 
                                   WHERE members.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*');

子查询当前与普通名称和带有连字符的名称匹配。父查询应显示与该模式不匹配的成员。目前,该查询需要花费大量时间来运行(我从未见过它完成)。我不确定为什么需要这么长时间或如何改进它。

I have discovered some suspect data in a database. I am attempting to determine if a certain field, lastname, is correct. I have come up with the following query in postgres:

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members."memberID" NOT IN (SELECT members."memberID" 
                                    FROM members 
                                   WHERE members.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*');

The subquery currently matches against normal names and names with a hypen. The parent query should display the members who don't match that pattern. Currently the query takes an incredible amount of time to run (i've never seen it complete). I am not sure why it takes so long or how to improve it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冬天旳寂寞 2024-11-14 05:09:05

不存在

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE NOT EXISTS (SELECT NULL
                     FROM MEMBERS b
                    WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
                      AND b."memberID" = m."memberID");

左连接/为空

   SELECT m."memberID", 
          m.lastname 
     FROM MEMBERS m 
LEFT JOIN MEMBERS b ON b."memberID" = m."memberID"
                   AND b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
    WHERE b."memberID" IS NULL

摘要

引用

PostgreSQL 同等对待 LEFT JOINNOT EXISTS,对它们使用相同的执行计划(即上面示例中的哈希反连接)。

至于 NOT IN,它在语义上是不同的,因为它的逻辑是三价的并且可以返回 NULL,PostgreSQL 尝试考虑到这一点,并限制自己使用针对子计划(散列的过滤器)的过滤器。可散列结果集的子计划,如上面的示例)。

由于每个缺失值需要在哈希表中搜索两次(第一次查找该值,第二次查找 NULL),因此这种方法效率稍低。

优化器可以在任何时候确定列表无法放入内存时使用普通子计划,效率非常低,并且应该像避免瘟疫一样避免可能使用它的查询。

这就是为什么在 PostgreSQL 8.4 中应该始终使用 LEFT JOIN / IS NULLNOT EXISTS 而不是 NOT IN 来查找缺失值.

附录

但正如 Andrew Lazarus 指出的那样,如果 MEMBERS 表中没有重复的 memberid,则查询只需:

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'

NOT EXISTS

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE NOT EXISTS (SELECT NULL
                     FROM MEMBERS b
                    WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
                      AND b."memberID" = m."memberID");

LEFT JOIN / IS NULL

   SELECT m."memberID", 
          m.lastname 
     FROM MEMBERS m 
LEFT JOIN MEMBERS b ON b."memberID" = m."memberID"
                   AND b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
    WHERE b."memberID" IS NULL

Summary

Quote:

PostgreSQL treats LEFT JOIN and NOT EXISTS equally, using same execution plan for both of them (namely a Hash Anti Join for the example above).

As for NOT IN, which is semantically different since its logic is trivalent and it can return NULL, PostgreSQL tries to take this into account and limits itself to using a filter against a subplan (a hashed subplan for a hashable resultset like in example above).

Since it need to search the hash table for each missing value twice (first time to find the value, second time to find a NULL), this method is a little less efficient.

A plain subplan, which the optimizer can resort to any time it decides the list will not fit into the memory, is very inefficient and the queries that have possibility of using it should be avoided like a plague.

That’s why in PostgreSQL 8.4 one should always use LEFT JOIN / IS NULL or NOT EXISTS rather than NOT IN to find the missing values.

Addendum

But as Andrew Lazarus points out, if there are no duplicates of memberid in the MEMBERS table, the query only needs to be:

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
梅倚清风 2024-11-14 05:09:05

我喜欢 OMG Ponies 的答案,但是如果 memberID 是唯一的(即 PK),您可以完全删除子查询。

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members.lastname !~ '[a-zA-Z]+([-][a-zA-Z]+)*';

(我删除了不区分大小写的运算符,因为正则表达式涵盖了这两种情况。)

I like OMG Ponies answer, but if memberID is unique (i.e., PK), you can just drop the subquery altogether.

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members.lastname !~ '[a-zA-Z]+([-][a-zA-Z]+)*';

(I deleted the case-insensitive operator since the regexp covers both cases.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文