我该如何构建匹配算法?

发布于 2024-08-19 23:54:49 字数 1089 浏览 2 评论 0原文

我以前从未构建过匹配算法,并且真的不知道从哪里开始。这是我的基本设置以及我这样做的原因。如果我没有问正确的问题,请随时纠正我。

我有一个包含人员姓名和唯一标识符的数据库。几个生成的标识符(内部生成的和一些第三方的)、姓氏、名字和出生日期是我将使用的主要标识符。

一年中,我多次收到来自第三方的列表,需要导入该列表并将其与我数据库中的现有人员关联起来,但数据从来没有我的那么干净。 ID 可能会更改、出生日期可能有拼写错误、姓名可能有拼写错误、姓氏可能会更改等等。

每次导入可能有 20,000 条记录,因此即使准确率达到 99%,仍然有 200 条记录,我必须手动输入并匹配。我认为在将传入人员与我的用户进行匹配时,我正在寻找 99.9% 左右的准确度。

那么,我该如何制定一个可以解决这个问题的算法呢?

PS 即使您没有确切的答案,但知道一些参考资料也会有所帮助。

PPS 一些示例与 m3rLinEz 所写的类似:

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Original'

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:10/20/84  '- Typo in birth date'
ID: 0876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Wrong ID'
ID: 9876234 Fname: Jose     LName: Guitierrez-Brown Birthdate:01/20/84  '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez       Birthdate:01/20/84  '- Added middle initial'
ID: 3453555 Fname: Joseph   LName: Guitierrez       Birthdate:01/20/84  '- Probably someone else with same birthdate and same last name'

I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.

I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.

Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.

Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.

So, how do I go about making an algorithm that can figure this out?

PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.

PPS Some examples would be similar to what m3rLinEz wrote:

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Original'

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:10/20/84  '- Typo in birth date'
ID: 0876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Wrong ID'
ID: 9876234 Fname: Jose     LName: Guitierrez-Brown Birthdate:01/20/84  '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez       Birthdate:01/20/84  '- Added middle initial'
ID: 3453555 Fname: Joseph   LName: Guitierrez       Birthdate:01/20/84  '- Probably someone else with same birthdate and same last name'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

蹲在坟头点根烟 2024-08-26 23:54:49

您可能对编辑距离感兴趣。

两个之间的编辑距离
字符串被定义为最小值
转换所需的编辑次数
将一根绳子插入另一根绳子,
允许的编辑操作是
插入、删除或替换
单个字符的。它被命名为
在弗拉基米尔·莱文斯坦之后
1965 年考虑了这个距离。1

可以比较每个字段并计算总距离。通过反复试验,您可能会发现允许将记录解释为匹配的适当阈值。我自己没有实现这个,只是想到了这个想法:}

例如:

  • 记录 A - ID:4831213321,名称:Jane
  • 记录 B - ID:431213321,名称:Jann
  • 记录 C - ID:4831211021,姓名:John

A 和 B 之间的距离将小于 A 和 C / B 和 C 之间的距离,这表明匹配得更好。

You might be interested in Levenshtein distance.

The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character. It is named
after Vladimir Levenshtein, who
considered this distance in 1965.1

It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}

For example:

  • Record A - ID: 4831213321, Name: Jane
  • Record B - ID: 431213321, Name: Jann
  • Record C - ID: 4831211021, Name: John

The distance between A and B will be lower than A and C / B and C, which indicates better match.

送君千里 2024-08-26 23:54:49

当遇到这样的事情时,不要重新发明轮子。如果您必须自己执行此操作,则 Levehstein 距离可能是您的最佳选择,但除此之外,请对进行数据库查询和模糊搜索的现有解决方案进行一些研究。他们做这件事的时间比你长,可能也会更好……

祝你好运!

When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..

Good luck!

薄荷→糖丶微凉 2024-08-26 23:54:49

如果您正在处理这种大小的数据集和导入的不同资源,您可能需要研究身份管理解决方案。我最熟悉 Sun Identity Manager,但对于您想要做的事情来说,它可能有点大材小用。也许值得研究一下。

If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.

烂人 2024-08-26 23:54:49

如果您从第三方获得的数据是一致的(每次格式相同),我可能会为您从中获取数据的每个第三方创建一个表。然后每次将每组新数据导入到同一个表中。我知道有一种方法可以使用 SQL 语句根据每个表中的公共列连接两个表。这样您就可以执行 SQL 查询并从多个表获取数据,但使其看起来像是来自一个统一的表。同样,可以找到两个表中没有匹配项的添加记录,然后手动配对。通过这种方式,您可以将“干净”的数据与从第三方获得的垃圾数据分开。如果您想要真正的导入,则可以使用该连接表来创建包含所有数据的第三个表。

If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.

不乱于心 2024-08-26 23:54:49

我会从简单的接近 100% 的特定匹配开始,然后首先处理它们,所以现在您有一个需要修复的 200 个列表。

对于其余行,您可以使用简化版本的贝叶斯定理

对于每个不匹配的行,计算它与数据集中的每一行匹配的可能性,假设数据包含以特定概率发生的特定更改。例如,一个人更改姓氏的概率为 0.1%(也可能取决于性别),更改名字的概率为 0.01%,并且有一个拼写错误的概率为 0.2%(使用 Levenshtein 的距离 来计算拼写错误的数量)。其他字段也会以一定的概率发生变化。对于每一行,考虑所有已更改的字段,计算该行匹配的可能性。然后选择匹配概率最高的那个。

例如,某一行在一个字段中只有一个小拼写错误,但在所有其他字段中都相同,则匹配的机会为 0.2%,而在许多字段中不同的行可能只有 0.0000001% 的机会。所以你选择了有小错别字的行。

I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.

For the remaining rows you can use a simplified version of Bayes' Theorem.

For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.

For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.

数理化全能战士 2024-08-26 23:54:49

正则表达式就是你所需要的,为什么要重新发明轮子呢?

Regular expressions are what you need, why reinvent the wheel?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文