如何使用多个标准来识别 MySQL 表中的唯一人员
我有大约 20 万条记录表dat
,其中包含人员的识别数据以及测试日期和结果,这是一个虚拟版本:
+----+---------+----------+------------+----------+------------+--------+
| id | surname | forename | dob | SchoolID | testDate | result |
+----+---------+----------+------------+----------+------------+--------+
| 1 | Smith | Mary | 1980-04-11 | NULL | 2005-10-12 | 14.32 |
| 2 | Smith | Mary | 1980-04-11 | 1234 | 2007-03-02 | 18.1 |
| 3 | Jones | Kim | 1978-10-24 | 4657 | 2002-04-14 | 24.31 |
| 4 | Jones | Kim | NULL | 4567 | 2002-10-08 | 19.02 |
| 5 | Roberts | Kim | 1978-10-24 | 4567 | 2003-12-18 | 14.19 |
| 6 | Roberts | Kim | 1978-10-24 | 4567 | 2005-02-11 | 18.26 |
+----+---------+----------+------------+----------+------------+--------+
我想识别进行过多次测试的人员,并且创建一个新列uniqueID
,为独特的人提供新的 ID。遗憾的是,我的数据库不是很整洁,所以我需要使用几个标准将这些记录分成唯一的个体:
- 姓氏、名字和出生日期是相同的(因为 SchoolID 经常丢失或错误)
- 姓氏、名字和 SchoolID 是相同的(因为有时未输入出生日期,或者 输入错误)
- 名字、出生日期和 SchoolID 相同(人们有时会结婚)
- 等。
在上面的示例中,表中只有两个唯一的人 Mary Smith 和 Kim Roberts nee Jones,所以这个 uniqueID
专栏最终应该是:
+----+----------+
| id | uniqueID |
+----+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
+----+----------+
虽然我已经涉足 MySQL 几年了,但我仍然是一个初学者。我已经搜索和尝试了几天,到目前为止我已经做到了这一点:
SELECT surname, forename, SchoolID
FROM dat
GROUP BY CONCAT(surname, forename, SchoolID);
例如,这是第二个标准(我计划在新表中提供自动递增键,然后加入回来)作为唯一 ID),但我对这些多重标准不太了解,所以非常感谢任何帮助!
谢谢 缺口
I have a ~200k record table dat
of people with their identifying data as well as a test date and result, here's a dummy version:
+----+---------+----------+------------+----------+------------+--------+
| id | surname | forename | dob | SchoolID | testDate | result |
+----+---------+----------+------------+----------+------------+--------+
| 1 | Smith | Mary | 1980-04-11 | NULL | 2005-10-12 | 14.32 |
| 2 | Smith | Mary | 1980-04-11 | 1234 | 2007-03-02 | 18.1 |
| 3 | Jones | Kim | 1978-10-24 | 4657 | 2002-04-14 | 24.31 |
| 4 | Jones | Kim | NULL | 4567 | 2002-10-08 | 19.02 |
| 5 | Roberts | Kim | 1978-10-24 | 4567 | 2003-12-18 | 14.19 |
| 6 | Roberts | Kim | 1978-10-24 | 4567 | 2005-02-11 | 18.26 |
+----+---------+----------+------------+----------+------------+--------+
I would like to identify people who have had more than one test, and create a new column uniqueID
that gives unique people a new id. Sadly, my database is not very tidy so I need to use several criteria to separate these records into unique individuals:
- surname, forename and dob are the same (since SchoolID is often missing or wrong)
- surname, forename and SchoolID are the same (since D.O.B. is sometimes not entered, or
entered wrong) - forename, dob and SchoolID are the same (people sometimes get married)
- etc.
In the example above, there are just two unique people in the table Mary Smith and Kim Roberts nee Jones, so this uniqueID
column should end up being:
+----+----------+
| id | uniqueID |
+----+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
+----+----------+
Although I've been dabbled a bit with MySQL for a few years I'm still a beginner. I've been searching and trying things for a few days, and so far I have managed this:
SELECT surname, forename, SchoolID
FROM dat
GROUP BY CONCAT(surname, forename, SchoolID);
which is the second criterion, for example, (that I was planning to give an auto incrementing key to in a new table, then join back in as the unique ID), but I'm not getting very with these multiple criteria, so any help much appreciated!
Thanks
Nick
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过自连接来做到这一点。不过,我会分多个步骤进行——从最好到最弱的匹配。
因此,假设 schoolID 在存在时是可靠的,这就是查找重复项的方式:(
对古老的连接语法表示歉意)。
运行一下,看看是否有任何奇怪的现象;如果你喜欢它,请将其变成更新。在 MySQL 中使用自连接进行更新有点痛苦,最简单的方法是创建一个视图来连接。
现在,您转到第二个最强的链接,例如:
再次运行它,检查数据是否有异常,然后转换为更新语句。
接下来,放宽限制 - 如果我们之前已经将父记录与另一条记录进行了匹配,那么我们可以删除对 null 的检查:
等等。
通过从最大可能性向下工作,您可以降低相当脆弱的相似性覆盖明确链接的风险(例如,两个名叫“Kim”的人碰巧出生于 1978 年 10 月 24 日,
通过首先运行选择,然后将它们转换为更新其次,避免人为错误。
You can do this through self joins. I'd do it in multiple steps, though - from best to weakest match.
So, assuming schoolID is reliable when present, this is how you find the duplicates:
(Apologies for antediluvian join syntax).
Run that, and see if you get any oddities; if you like it, turn it into an update. Updates with a self join are a little painful in MySQL, and the simplest way is to create a view to join.
Now, you go to the second strongest link, something like:
Again, run it, check the data for oddities, turn into an update statement.
Next, relax the restrictions - it's okay if we have previously matched the parent to another record, so we can remove the check for null:
And so on, and so forth.
By working from strongest likelihood down, you reduce the risk of a fairly tenuous similarity overriding a clear link (e.g. two people called "Kim" who happened to be born on 1978-10-24,
By running selects first, and turning them into updates second, you avoid human error.
您尝试过“选择不同的”吗?
have you tried "SELECT distinct"?