如何使用多个标准来识别 MySQL 表中的唯一人员

发布于 2024-11-30 00:53:26 字数 1683 浏览 1 评论 0原文

我有大约 20 万条记录表dat，其中包含人员的识别数据以及测试日期和结果，这是一个虚拟版本：

+----+---------+----------+------------+----------+------------+--------+
| id | surname | forename | dob        | SchoolID | testDate   | result |
+----+---------+----------+------------+----------+------------+--------+
|  1 | Smith   | Mary     | 1980-04-11 | NULL     | 2005-10-12 |  14.32 |
|  2 | Smith   | Mary     | 1980-04-11 | 1234     | 2007-03-02 |   18.1 |
|  3 | Jones   | Kim      | 1978-10-24 | 4657     | 2002-04-14 |  24.31 |
|  4 | Jones   | Kim      | NULL       | 4567     | 2002-10-08 |  19.02 |
|  5 | Roberts | Kim      | 1978-10-24 | 4567     | 2003-12-18 |  14.19 |
|  6 | Roberts | Kim      | 1978-10-24 | 4567     | 2005-02-11 |  18.26 |
+----+---------+----------+------------+----------+------------+--------+

我想识别进行过多次测试的人员，并且创建一个新列uniqueID，为独特的人提供新的 ID。遗憾的是，我的数据库不是很整洁，所以我需要使用几个标准将这些记录分成唯一的个体：

姓氏、名字和出生日期是相同的（因为 SchoolID 经常丢失或错误）
姓氏、名字和 SchoolID 是相同的（因为有时未输入出生日期，或者输入错误）
名字、出生日期和 SchoolID 相同（人们有时会结婚）
等。

在上面的示例中，表中只有两个唯一的人 Mary Smith 和 Kim Roberts nee Jones，所以这个 uniqueID 专栏最终应该是：

+----+----------+
| id | uniqueID |
+----+----------+
|  1 |        1 |
|  2 |        1 |
|  3 |        2 |
|  4 |        2 |
|  5 |        2 |
|  6 |        2 |
+----+----------+

虽然我已经涉足 MySQL 几年了，但我仍然是一个初学者。我已经搜索和尝试了几天，到目前为止我已经做到了这一点：

SELECT surname, forename, SchoolID 
   FROM dat 
   GROUP BY CONCAT(surname, forename, SchoolID);

例如，这是第二个标准（我计划在新表中提供自动递增键，然后加入回来）作为唯一 ID），但我对这些多重标准不太了解，所以非常感谢任何帮助！

谢谢缺口

原文

I have a ~200k record table dat of people with their identifying data as well as a test date and result, here's a dummy version:

+----+---------+----------+------------+----------+------------+--------+
| id | surname | forename | dob        | SchoolID | testDate   | result |
+----+---------+----------+------------+----------+------------+--------+
|  1 | Smith   | Mary     | 1980-04-11 | NULL     | 2005-10-12 |  14.32 |
|  2 | Smith   | Mary     | 1980-04-11 | 1234     | 2007-03-02 |   18.1 |
|  3 | Jones   | Kim      | 1978-10-24 | 4657     | 2002-04-14 |  24.31 |
|  4 | Jones   | Kim      | NULL       | 4567     | 2002-10-08 |  19.02 |
|  5 | Roberts | Kim      | 1978-10-24 | 4567     | 2003-12-18 |  14.19 |
|  6 | Roberts | Kim      | 1978-10-24 | 4567     | 2005-02-11 |  18.26 |
+----+---------+----------+------------+----------+------------+--------+

I would like to identify people who have had more than one test, and create a new column uniqueID that gives unique people a new id. Sadly, my database is not very tidy so I need to use several criteria to separate these records into unique individuals:

surname, forename and dob are the same (since SchoolID is often missing or wrong)
surname, forename and SchoolID are the same (since D.O.B. is sometimes not entered, or
entered wrong)
forename, dob and SchoolID are the same (people sometimes get married)
etc.

In the example above, there are just two unique people in the table Mary Smith and Kim Roberts nee Jones, so this uniqueID column should end up being:

+----+----------+
| id | uniqueID |
+----+----------+
|  1 |        1 |
|  2 |        1 |
|  3 |        2 |
|  4 |        2 |
|  5 |        2 |
|  6 |        2 |
+----+----------+

Although I've been dabbled a bit with MySQL for a few years I'm still a beginner. I've been searching and trying things for a few days, and so far I have managed this:

SELECT surname, forename, SchoolID 
   FROM dat 
   GROUP BY CONCAT(surname, forename, SchoolID);

which is the second criterion, for example, (that I was planning to give an auto incrementing key to in a new table, then join back in as the unique ID), but I'm not getting very with these multiple criteria, so any help much appreciated!

Thanks
Nick

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡忘如思 2024-12-07 00:53:27

您可以通过自连接来做到这一点。不过，我会分多个步骤进行——从最好到最弱的匹配。

因此，假设 schoolID 在存在时是可靠的，这就是查找重复项的方式：（

select *
from   dat parent, 
       dat child
where  parent.id      <> child.id
and    parent.schoolid =  child.schoolid

对古老的连接语法表示歉意）。

运行一下，看看是否有任何奇怪的现象；如果你喜欢它，请将其变成更新。在 MySQL 中使用自连接进行更新有点痛苦，最简单的方法是创建一个视图来连接。

create view dupes as

select   min(parent.id) as uniqueid, child.id
from     dat parent, 
         dat child
where    parent.schoolid =  child.schoolid
group by child.id

update  ignore dat, dupes
set     dat.uniqueid =   dupes.uniqueid
where    dat.id      = dupes.id

现在，您转到第二个最强的链接，例如：

    select *
    from   dat parent, 
           dat child
    where  parent.id      <> child.id
    and    parent.uniqueID is  null
    and    child.uniqueID  is  null
    and    parent.forename = child.forename
    and    parent.surname  = child.surname
    and    parent.dob      = child.dob

再次运行它，检查数据是否有异常，然后转换为更新语句。

接下来，放宽限制 - 如果我们之前已经将父记录与另一条记录进行了匹配，那么我们可以删除对 null 的检查：

select *
from   dat parent, 
       dat child
where  parent.id      <> child.id
and    child.uniqueID  is  null
and    parent.forename = child.forename
and    parent.surname  = child.surname
and    parent.dob      = child.dob

等等。
通过从最大可能性向下工作，您可以降低相当脆弱的相似性覆盖明确链接的风险（例如，两个名叫“Kim”的人碰巧出生于 1978 年 10 月 24 日，

通过首先运行选择，然后将它们转换为更新其次，避免人为错误。

You can do this through self joins. I'd do it in multiple steps, though - from best to weakest match.

So, assuming schoolID is reliable when present, this is how you find the duplicates:

select *
from   dat parent, 
       dat child
where  parent.id      <> child.id
and    parent.schoolid =  child.schoolid

(Apologies for antediluvian join syntax).

Run that, and see if you get any oddities; if you like it, turn it into an update. Updates with a self join are a little painful in MySQL, and the simplest way is to create a view to join.

create view dupes as

select   min(parent.id) as uniqueid, child.id
from     dat parent, 
         dat child
where    parent.schoolid =  child.schoolid
group by child.id

update  ignore dat, dupes
set     dat.uniqueid =   dupes.uniqueid
where    dat.id      = dupes.id

Now, you go to the second strongest link, something like:

    select *
    from   dat parent, 
           dat child
    where  parent.id      <> child.id
    and    parent.uniqueID is  null
    and    child.uniqueID  is  null
    and    parent.forename = child.forename
    and    parent.surname  = child.surname
    and    parent.dob      = child.dob

Again, run it, check the data for oddities, turn into an update statement.

Next, relax the restrictions - it's okay if we have previously matched the parent to another record, so we can remove the check for null:

select *
from   dat parent, 
       dat child
where  parent.id      <> child.id
and    child.uniqueID  is  null
and    parent.forename = child.forename
and    parent.surname  = child.surname
and    parent.dob      = child.dob

And so on, and so forth.
By working from strongest likelihood down, you reduce the risk of a fairly tenuous similarity overriding a clear link (e.g. two people called "Kim" who happened to be born on 1978-10-24,

By running selects first, and turning them into updates second, you avoid human error.

回复收藏 0 原文