个人人口统计信息的模糊数据匹配

发布于 2024-09-10 08:28:45 字数 1181 浏览 5 评论 0原文

假设我有一个数据库，其中包含具有以下数据元素的人员：

PersonID（无意义的代理自动编号）
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID（如 SSN、Militarty ID 等）

我从各种格式中获取了大量数据源以及您能想到的这些信息的各种合理变化。一些示例是：

FullName、DOB
FullName、Last 4 SSN
First、Last、DOB

当此数据传入时，我需要编写一些内容来匹配它。我不需要或期望获得超过 80% 的匹配率。自动匹配后，我会将不确定的匹配显示在网页上，供某人手动匹配。

其中一些复杂性是：

一些数据匹配比其他数据匹配更好，我想为这些数据分配权重。例如，如果 SSN 完全匹配，但由于有人使用中间名而导致姓名被关闭，那么我想为该匹配分配比名称完全匹配但 SSN 关闭时更高的置信度值。
名称匹配有一些困难。 John Doe Jr. 与 John Doe II 相同，但与 John Doe Sr. 不同，如果我得到 John Doe 而没有其他信息，我需要确保系统不会选择一个，因为无法确定选择谁。
名字匹配真的很难。您有 Bob/Robert、John/Jon/Jonathon、Tom/Thomas 等。
仅仅因为我有一个包含 FullName+DOB 的提要，并不意味着每条记录都会填充 DOB 字段。我不想仅仅因为不匹配的 DOB 破坏了匹配的分数而错过链接。如果缺少某个字段，我想将其从可用于匹配的元素中排除。
如果有人手动匹配，我希望他们的匹配影响所有未来的匹配。因此，如果我们再次获得相同的精确数据，下次没有理由不自动匹配它。

我已经看到SSIS有模糊匹配，但我们目前不使用SSIS，而且我发现它非常笨拙并且几乎不可能进行版本控制，所以它不是我的首选工具。但如果这是最好的，请告诉我。否则，是否有任何（最好是免费的，最好是基于 .NET 或 T-SQL 的）您曾经使用过解决此类问题的工具/库/实用程序/技术吗？

原文

Let's say I have a database filled with people with the following data elements:

PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)

I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:

FullName, DOB
FullName, Last 4 SSN
First, Last, DOB

When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.

Some of the complexities are:

Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.

I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?

分享到QQ

分享到微博