对 Postgres 中的 upsert 实施 2 个独特的约束
我有一个保存联系人数据的表,
Table "public.person"
Column | Type | Collation | Nullable | Default
-------------------------------+--------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('person_id_seq'::regclass)
full_name | character varying | | |
role | character varying | | |
first_name | character varying | | |
last_name | character varying | | |
linkedin_slug | character varying | | |
email | character varying | | |
domain | character varying | | |
created_at | timestamp with time zone | | | now()
updated_at | timestamp with time zone | | | now()
Indexes:
"pk_person" PRIMARY KEY, btree (id)
"ix_person_domain" btree (domain)
"ix_person_email" btree (email)
"ix_person_updated_at" btree (updated_at)
"uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)
我从多个来源向该表添加数据。一些来源具有有关人员的 Linkedin 个人资料数据,其他来源具有电子邮件数据。有时,即使他们指的是同一个人,全名也不相同。
我想做 upserts 以避免重复的数据。现在我正在使用full_name, domain
的约束。我知道这过于简单化了,因为同一家公司可能有 2 个不同的人具有相同的全名,但目前这不是问题。
当一个人在我使用的不同数据源中具有不同的全名,但具有相同的 Linkedin 个人资料时,问题就出现了,所以我知道这是同一个人。
或者当它们与同一家公司的 2 个域关联时。
在这些情况下,我最终会为某些人提供重复的行。例如:
full_name | domain | linkedin_slug |
---|---|---|
Raffi SARKISSIAN | getlago.com | sarkissianraffi |
Raffi Sarkissian | getlago.com | sarkissianraffi |
这是一个微不足道的问题,可以通过对lower(full_name), domain 进行约束来解决,但也有一些情况姓氏不相同(在许多国家,人们有超过 1 个姓氏,并且他们可能不会全部使用它们) 有时)。
另一个示例
full_name | 域 | linkedin_slug |
---|---|---|
Amir Manji | tenjin.com | amirmanji |
Amir Manji | tenjin.io | amirmanji |
理想情况下,我希望能够在 Postgres 中同时强制执行多个约束,但我已经看到 这并不容易或开箱即用。我不/无法对(full_name、domain、linkedin_slug)
创建唯一约束。接受的答案中的解决方案对于我的用例来说不太好,因为我的列比该示例中的列多,并且我必须为每个数据源编写不同的 upsert 函数(并非所有数据源都具有相同的属性)
我的想法是制作一个脚本,在插入新数据后“手动”删除重复信息,但我不确定是否有更好的方法来解决这个问题。
你会怎样做呢?
I have a table where I save contacts data
Table "public.person"
Column | Type | Collation | Nullable | Default
-------------------------------+--------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('person_id_seq'::regclass)
full_name | character varying | | |
role | character varying | | |
first_name | character varying | | |
last_name | character varying | | |
linkedin_slug | character varying | | |
email | character varying | | |
domain | character varying | | |
created_at | timestamp with time zone | | | now()
updated_at | timestamp with time zone | | | now()
Indexes:
"pk_person" PRIMARY KEY, btree (id)
"ix_person_domain" btree (domain)
"ix_person_email" btree (email)
"ix_person_updated_at" btree (updated_at)
"uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)
I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.
And I want to do upserts to not have duplicated data. For now I'm using the constraint on full_name, domain
. I know it's an oversimplification as there may be 2 different people with the same full name in the same company, but that's not a problem at this moment.
The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it's the same person.
Or when they're associated to 2 domains from the same company.
In those cases, I end up with duplicated rows for some people. For example:
full_name | domain | linkedin_slug |
---|---|---|
Raffi SARKISSIAN | getlago.com | sarkissianraffi |
Raffi Sarkissian | getlago.com | sarkissianraffi |
That one is a trivial one that could be solved making the constraint on lower(full_name), domain
, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).
Another example
full_name | domain | linkedin_slug |
---|---|---|
Amir Manji | tenjin.com | amirmanji |
Amir Manji | tenjin.io | amirmanji |
Ideally I'd like to be able to enforce more than 1 constraint at the same time in Postgres, but I've seen it's not easy or out-of-the-box. I don't/can't create a unique constraint on (full_name, domain, linkedin_slug)
. And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I'd have to write a different upsert function for each data source (not all of them have the same attributes)
What I'm thinking is making a script to deduplicate the info 'manually' after inserting new data, but I'm not sure if there are better ways to address this.
How would you go about it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们无法对列函数创建唯一约束,但可以对作为列函数的虚拟列创建唯一约束,例如
LOWER()
。对于域,我创建了一个虚拟列,其中包含第一个点之前的部分。然后,这受到唯一的约束。注意:Postgres 12 或更高版本支持虚拟列。
通过这些方式,我们检测到
Joe Bloggs
的重复项hello.UK
与hello.com
重复无效
不是电子邮件地址。检查电子邮件地址是否有效非常复杂。简单检查是否有
@
后跟一个点,可以避免电话号码等。您必须确定哪些约束是可执行的,哪些可能会阻止您输入可接受的数据。
详细信息:密钥 (l_fname)=(joe bloggs) 已存在。
信息:失败行包含 (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null ,乔·博客)。
详细信息:密钥 (domainroot)=(hello) 已存在。
db<>fiddle 此处
We cannot create unique constraints on functions of columns but we can create unique constraints on virtual columns which are functions of columns, for example
LOWER()
. For domain I have created a virtual column with the part before the first point. This is then subject to a unique constraint.NB: Virtual columns are supported in Postgres 12 or higher.
By these means we detect
Joe Bloggs
hello.UK
duplicatehello.com
invalid
is not an e-mail address.Checking that an e-mail adresses is valid is complicated. A simple check that there is an
@
followed by a point avoids telephone numbers etc.You will have to determine which CONSTRAINTS are enforcable and which may stop you entering data which should be acceptable.
DETAIL: Key (l_fname)=(joe bloggs) already exists.
DETAIL: Failing row contains (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null, joe bloggs).
DETAIL: Key (domainroot)=(hello) already exists.
db<>fiddle here
更新:我最终是通过首先执行upSerts执行
full_name,domain
是唯一的,然后在linkd> linkedin_slug
上删除脚本,该脚本基本上由code> linksedin_slug
是唯一的。 >并获取任何不为null的值:然后使用此子查询中的数据更新原始表。
完整的要点是在这里
Update: I ended up doing it by first doing upserts enforcing
full_name, domain
is unique, then deduplicating onlinkedin_slug
running a script that basically groups bylinkedin_slug
and gets whatever values are not null:And then updating the original table with data from this subquery.
The full gist is here