对 Postgres 中的 upsert 实施 2 个独特的约束

发布于 2025-01-19 23:42:56 字数 2968 浏览 2 评论 0原文

我有一个保存联系人数据的表，

                                                Table "public.person"
            Column             |           Type           | Collation | Nullable |              Default               
-------------------------------+--------------------------+-----------+----------+------------------------------------
 id                            | integer                  |           | not null | nextval('person_id_seq'::regclass)
 full_name                     | character varying        |           |          | 
 role                          | character varying        |           |          | 
 first_name                    | character varying        |           |          | 
 last_name                     | character varying        |           |          | 
 linkedin_slug                 | character varying        |           |          | 
 email                         | character varying        |           |          | 
 domain                        | character varying        |           |          | 
 created_at                    | timestamp with time zone |           |          | now()
 updated_at                    | timestamp with time zone |           |          | now()
Indexes:
    "pk_person" PRIMARY KEY, btree (id)
    "ix_person_domain" btree (domain)
    "ix_person_email" btree (email)
    "ix_person_updated_at" btree (updated_at)
    "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)

我从多个来源向该表添加数据。一些来源具有有关人员的 Linkedin 个人资料数据，其他来源具有电子邮件数据。有时，即使他们指的是同一个人，全名也不相同。

我想做 upserts 以避免重复的数据。现在我正在使用full_name, domain 的约束。我知道这过于简单化了，因为同一家公司可能有 2 个不同的人具有相同的全名，但目前这不是问题。

当一个人在我使用的不同数据源中具有不同的全名，但具有相同的 Linkedin 个人资料时，问题就出现了，所以我知道这是同一个人。

或者当它们与同一家公司的 2 个域关联时。

在这些情况下，我最终会为某些人提供重复的行。例如：

full_name	domain	linkedin_slug
Raffi SARKISSIAN	getlago.com	sarkissianraffi
Raffi Sarkissian	getlago.com	sarkissianraffi

这是一个微不足道的问题，可以通过对lower(full_name), domain 进行约束来解决，但也有一些情况姓氏不相同（在许多国家，人们有超过 1 个姓氏，并且他们可能不会全部使用它们）有时）。

另一个示例

full_name	域	linkedin_slug
Amir Manji	tenjin.com	amirmanji
Amir Manji	tenjin.io	amirmanji

理想情况下，我希望能够在 Postgres 中同时强制执行多个约束，但我已经看到这并不容易或开箱即用。我不/无法对（full_name、domain、linkedin_slug）创建唯一约束。接受的答案中的解决方案对于我的用例来说不太好，因为我的列比该示例中的列多，并且我必须为每个数据源编写不同的 upsert 函数（并非所有数据源都具有相同的属性）

我的想法是制作一个脚本，在插入新数据后“手动”删除重复信息，但我不确定是否有更好的方法来解决这个问题。

你会怎样做呢？

原文

I have a table where I save contacts data

                                                Table "public.person"
            Column             |           Type           | Collation | Nullable |              Default               
-------------------------------+--------------------------+-----------+----------+------------------------------------
 id                            | integer                  |           | not null | nextval('person_id_seq'::regclass)
 full_name                     | character varying        |           |          | 
 role                          | character varying        |           |          | 
 first_name                    | character varying        |           |          | 
 last_name                     | character varying        |           |          | 
 linkedin_slug                 | character varying        |           |          | 
 email                         | character varying        |           |          | 
 domain                        | character varying        |           |          | 
 created_at                    | timestamp with time zone |           |          | now()
 updated_at                    | timestamp with time zone |           |          | now()
Indexes:
    "pk_person" PRIMARY KEY, btree (id)
    "ix_person_domain" btree (domain)
    "ix_person_email" btree (email)
    "ix_person_updated_at" btree (updated_at)
    "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)

I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.

And I want to do upserts to not have duplicated data. For now I'm using the constraint on full_name, domain. I know it's an oversimplification as there may be 2 different people with the same full name in the same company, but that's not a problem at this moment.

The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it's the same person.

Or when they're associated to 2 domains from the same company.

In those cases, I end up with duplicated rows for some people. For example:

full_name	domain	linkedin_slug
Raffi SARKISSIAN	getlago.com	sarkissianraffi
Raffi Sarkissian	getlago.com	sarkissianraffi

That one is a trivial one that could be solved making the constraint on lower(full_name), domain, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).

Another example

full_name	domain	linkedin_slug
Amir Manji	tenjin.com	amirmanji
Amir Manji	tenjin.io	amirmanji

Ideally I'd like to be able to enforce more than 1 constraint at the same time in Postgres, but I've seen it's not easy or out-of-the-box. I don't/can't create a unique constraint on (full_name, domain, linkedin_slug). And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I'd have to write a different upsert function for each data source (not all of them have the same attributes)

What I'm thinking is making a script to deduplicate the info 'manually' after inserting new data, but I'm not sure if there are better ways to address this.

How would you go about it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠北森屋 2025-01-26 23:42:56

我们无法对列函数创建唯一约束，但可以对作为列函数的虚拟列创建唯一约束，例如 LOWER()。对于域，我创建了一个虚拟列，其中包含第一个点之前的部分。然后，这受到唯一的约束。
注意：Postgres 12 或更高版本支持虚拟列。
通过这些方式，我们检测到

“Joe BLOGGS”是 Joe Bloggs 的重复项
，hello.UK 与 hello.com 重复
，无效不是电子邮件地址。
检查电子邮件地址是否有效非常复杂。简单检查是否有 @ 后跟一个点，可以避免电话号码等。
您必须确定哪些约束是可执行的，哪些可能会阻止您输入可接受的数据。

创建表人员（
 id 序列号不为空，
 全名 varchar(25) ,
 角色 varchar(25) ,
 名字 varchar(25) ,
 姓氏 varchar(25) ,
 linkedin_slug varchar(25) ,
 电子邮件 varchar(25) ,
 域 varchar(25) ,
 created_at 时间戳默认 now() ,
 Updated_at 时间戳默认 now() ，
 domainRoot varchar(25) 始终生成为 ( LEFT(domain, STRPOS(domain,'.')-1)) STORED,
 l_fname varchar(25) 始终生成为 ( LOWER(full_name)) 存储，
 约束“pk_person”主键（id），
 约束“uq_person_full_name_domain”唯一（全名，域），
 约束“uq_lower_full_name”唯一（l_fname），
 约束“uq_email”唯一（电子邮件），
 约束“ck_valid_email”检查（电子邮件如'%@%.%'），
 约束“uq_domain_root”唯一（domainRoot）
）；

插入人（全名）值（'Joe Bloggs'），（'Mrs Brown'）

插入人（全名）值（'Joe BLOGGS'）

错误：重复的键值违反了唯一约束“uq_lower_full_name”

详细信息：密钥 (l_fname)=(joe bloggs) 已存在。

更新人员设置 email = 'invalid', Updated_at = now()

错误：关系“person”的新行违反了检查约束“ck_valid_email”
详细

信息：失败行包含 (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null ，乔·博客）。

更新人员集电子邮件 = '[电子邮件受保护] ' 其中 id = 2；

更新人员集domain = 'hello.com' , Updated_at = now() where id = 1;

更新人员集域 = 'hello.UK' 其中 id = 2

错误：重复的键值违反了唯一约束“uq_domain_root”

详细信息：密钥 (domainroot)=(hello) 已存在。

从人中选择 *
<前>id |全名 |角色 |名字 |姓氏 | linkedin_slug |电子邮件 |域 |创建于 |更新于 |域名根 | l_fname
-: | :--------- | :--- | :--------- | :-------- | :------------ | :--------- | :-------- | ：-------------------------- | ：-------------------------- | :--------- | :---------
2 |布朗夫人| 空 | 空 | 空 | 空 | [电子邮件受保护] | 空 | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.749316 | 空 |布朗夫人
1 |乔·布洛格斯 | 空 | 空 | 空 | 空 | 空 |你好。 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.753807 |你好 |乔·博格斯

db<>fiddle 此处

We cannot create unique constraints on functions of columns but we can create unique constraints on virtual columns which are functions of columns, for example LOWER(). For domain I have created a virtual column with the part before the first point. This is then subject to a unique constraint.
NB: Virtual columns are supported in Postgres 12 or higher.
By these means we detect

that 'Joe BLOGGS' is a duplicate of Joe Bloggs
that hello.UK duplicate hello.com
that invalidis not an e-mail address.
Checking that an e-mail adresses is valid is complicated. A simple check that there is an @ followed by a point avoids telephone numbers etc.
You will have to determine which CONSTRAINTS are enforcable and which may stop you entering data which should be acceptable.

create table person (
 id                             serial     not null       ,
 full_name                        varchar(25)                 ,
 role                             varchar(25)                 ,
 first_name                       varchar(25)                 ,
 last_name                        varchar(25)                 ,
 linkedin_slug                    varchar(25)                 ,
 email                            varchar(25)                 ,
 domain                           varchar(25)                 ,
 created_at                        timestamp  default  now()  ,
 updated_at                        timestamp  default  now()  ,
 domainRoot varchar(25) GENERATED ALWAYS AS ( LEFT(domain, STRPOS(domain,'.')-1)) STORED,
 l_fname varchar(25) GENERATED ALWAYS AS ( LOWER(full_name)) STORED,
 CONSTRAINT "pk_person" PRIMARY KEY (id),
 CONSTRAINT  "uq_person_full_name_domain" UNIQUE (full_name, domain),
 CONSTRAINT  "uq_lower_full_name" UNIQUE (l_fname),
 CONSTRAINT "uq_email" UNIQUE (email),
 CONSTRAINT "ck_valid_email" CHECK (email LIKE '%@%.%'),
 CONSTRAINT "uq_domain_root" UNIQUE (domainRoot)
);

insert into person (full_name) values ('Joe Bloggs'),('Mrs Brown')

insert into person (full_name) values ('Joe BLOGGS')

ERROR:  duplicate key value violates unique constraint "uq_lower_full_name"

DETAIL: Key (l_fname)=(joe bloggs) already exists.

update person set email = 'invalid', updated_at = now()

ERROR:  new row for relation "person" violates check constraint "ck_valid_email"

DETAIL: Failing row contains (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null, joe bloggs).

update person set email = '[email protected]' where id = 2;

update person set domain = 'hello.com' , updated_at = now() where id = 1;

update person set domain = 'hello.UK' where id = 2

ERROR:  duplicate key value violates unique constraint "uq_domain_root"

DETAIL: Key (domainroot)=(hello) already exists.

SELECT * FROM person

id | full_name  | role | first_name | last_name | linkedin_slug | email      | domain    | created_at                 | updated_at                 | domainroot | l_fname   
-: | :--------- | :--- | :--------- | :-------- | :------------ | :--------- | :-------- | :------------------------- | :------------------------- | :--------- | :---------
 2 | Mrs Brown  | null | null       | null      | null          | [email protected] | null      | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.749316 | null       | mrs brown 
 1 | Joe Bloggs | null | null       | null      | null          | null       | hello.com | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.753807 | hello      | joe bloggs

db<>fiddle here

回复收藏 0 原文

雨夜星沙 2025-01-26 23:42:56

更新：我最终是通过首先执行upSerts执行full_name，domain是唯一的，然后在linkd> linkedin_slug上删除脚本，该脚本基本上由code> linksedin_slug是唯一的。 >并获取任何不为null的值：

    SELECT
        max(id) id
        , max(full_name) full_name
        , max("role") "role"
        , max(first_name) first_name
        , max(last_name) last_name
        , linkedin_slug
        , max(linkedin_id) linkedin_id
        , max(email) email
        , max("domain") "domain"
        , max(yc_bio) yc_bio
        , min(created_at) created_at
        , now() updated_at
        , max(extrapolated_email_confidence) extrapolated_email_confidence
        , max(email_status) email_status
        , max(email_searched_on_apollo::text)::bool email_searched_on_apollo
    FROM person
    GROUP BY linkedin_slug
    HAVING count(*) > 1

然后使用此子查询中的数据更新原始表。

完整的要点是在这里

Update: I ended up doing it by first doing upserts enforcing full_name, domain is unique, then deduplicating on linkedin_slug running a script that basically groups by linkedin_slug and gets whatever values are not null:

    SELECT
        max(id) id
        , max(full_name) full_name
        , max("role") "role"
        , max(first_name) first_name
        , max(last_name) last_name
        , linkedin_slug
        , max(linkedin_id) linkedin_id
        , max(email) email
        , max("domain") "domain"
        , max(yc_bio) yc_bio
        , min(created_at) created_at
        , now() updated_at
        , max(extrapolated_email_confidence) extrapolated_email_confidence
        , max(email_status) email_status
        , max(email_searched_on_apollo::text)::bool email_searched_on_apollo
    FROM person
    GROUP BY linkedin_slug
    HAVING count(*) > 1

And then updating the original table with data from this subquery.

The full gist is here

回复收藏 0 原文

~没有更多了~