假设我有一个数据库,其中包含具有以下数据元素的人员:
- PersonID(无意义的代理自动编号)
- FirstName
- MiddleInitial
- LastName
- NameSuffix
- DateOfBirth
- AlternateID(如 SSN、Militarty ID 等)
我从各种格式中获取了大量数据源以及您能想到的这些信息的各种合理变化。一些示例是:
- FullName、DOB
- FullName、Last 4 SSN
- First、Last、DOB
当此数据传入时,我需要编写一些内容来匹配它。我不需要或期望获得超过 80% 的匹配率。自动匹配后,我会将不确定的匹配显示在网页上,供某人手动匹配。
其中一些复杂性是:
- 一些数据匹配比其他数据匹配更好,我想为这些数据分配权重。例如,如果 SSN 完全匹配,但由于有人使用中间名而导致姓名被关闭,那么我想为该匹配分配比名称完全匹配但 SSN 关闭时更高的置信度值。
- 名称匹配有一些困难。 John Doe Jr. 与 John Doe II 相同,但与 John Doe Sr. 不同,如果我得到 John Doe 而没有其他信息,我需要确保系统不会选择一个,因为无法确定选择谁。
- 名字匹配真的很难。您有 Bob/Robert、John/Jon/Jonathon、Tom/Thomas 等。
- 仅仅因为我有一个包含 FullName+DOB 的提要,并不意味着每条记录都会填充 DOB 字段。我不想仅仅因为不匹配的 DOB 破坏了匹配的分数而错过链接。如果缺少某个字段,我想将其从可用于匹配的元素中排除。
- 如果有人手动匹配,我希望他们的匹配影响所有未来的匹配。因此,如果我们再次获得相同的精确数据,下次没有理由不自动匹配它。
我已经看到SSIS有模糊匹配,但我们目前不使用SSIS,而且我发现它非常笨拙并且几乎不可能进行版本控制,所以它不是我的首选工具。但如果这是最好的,请告诉我。否则,是否有任何(最好是免费的,最好是基于 .NET 或 T-SQL 的)您曾经使用过解决此类问题的工具/库/实用程序/技术吗?
Let's say I have a database filled with people with the following data elements:
- PersonID (meaningless surrogate autonumber)
- FirstName
- MiddleInitial
- LastName
- NameSuffix
- DateOfBirth
- AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
- FullName, DOB
- FullName, Last 4 SSN
- First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
- Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
- The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
- First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
- Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
- If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?
发布评论
评论(4)
您可以通过多种方法来解决此问题,但是在完成此类事情之前,我将继续在这里指出,在人与人之间进行“不正确”匹配时,您会面临很大的风险。
您的输入数据非常稀疏,并且考虑到您所拥有的数据,它并不是最独特的,如果并非所有值都在那里。
例如,对于您的名字、姓氏、出生日期情况,如果您拥有所有记录的所有三个部分,那么匹配对您来说会更容易使用。如果不是,那么您就会面临很多潜在的问题。
从更“粗略”的角度来看,您可能采取的一种方法是简单地使用一系列查询来创建一个流程,该流程可以简单地识别和分类匹配的条目。
例如,首先检查姓名和 SSN 是否完全匹配,如果存在则标记它,将其标记为 100%,然后继续进行下一组。然后,您可以明确定义模糊的位置,以便了解匹配的潜在后果。
最后,您将得到一个列表,其中包含指示匹配类型的标志(如果该记录有的话)。
There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.
这是一个名为记录链接的问题。
虽然它适用于 python 库,但 dedupe 的文档给出了 关于如何全面解决问题的良好概述。
This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.
看一下 Levenshtein 算法,它允许您获取“两个字符串之间的距离”,然后将其除以字符串的长度以获得百分比匹配。
http://en.wikipedia.org/wiki/Levenshtein_distance
我之前已经实现了这个效果很好成功。这是一家医疗保健公司的提供商门户网站,提供商在该网站上注册。匹配是通过他们的门户注册并在主要医疗系统中找到相应的记录。参与此操作的处理者会看到最有可能的匹配项,按百分比降序排列,并且可以轻松选择正确的帐户。
Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.
如果误报不会困扰您并且您的语言主要是英语,您可以尝试 Soundex。 SQL Server 将其作为内置函数。 Soundex 不是最好的,但它确实可以进行模糊匹配并且很受欢迎。另一种选择是变音位。
If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.