使用什么逻辑来汇总/合并多个人员实体? (紧密,但模糊足以扩大匹配范围)
我有多个人员实体实例,这些实例通常是同一个人。如果地址 First-Last 在同一地址上相同,那么合并/汇总它们是理所当然的。
然而,由于数据输入不一致,必须有办法稍微偏离准确性。我认为信用卡行业做了一点:邮政编码加街道号码,还是街道名称? ……那种性质的东西。
为了巩固我的匹配,我清理了地址字符串,试图使它们尽可能标准(“Hwy”-->“高速公路”等)。
我需要一些仍然可以对记录进行匹配的东西,这些记录对人们来说只要看一眼就很明显,但无法获得完全匹配的数据。
这是我最初的想法,连接一个由以下内容组成的字符串:
First Initial
LEFT8 of the LastName (allows inconsistent endings, such as "Esq." or "CPA")
LEFT3 of Zip
Street Number
LEFT8 of the StreetName (not Addr1 -- "Oak" for "8 N Oak Street")
我在这里错过了什么吗?我想我把它做得足够宽松,可以克服正常的数据输入不一致,但又足够严格,可以避免不正确的匹配。
I have multiple instances of people entities which are often times the same person. Where the address First-Last is the same at the same address, it's a no-brainer to merge/rollup them.
However, due to data entry inconsistencies, there must be a way to deviate a bit from the exactness. I think the credit card industry does this a little bit: zip plus street number, or street name? ...something of that nature.
In order to solidify my matching, I cleaned up the address strings, trying to make them as standard as possible ("Hwy" --> "Highway", etc.).
I need something that still will make matches on records that would look obvious to a person just by glancing at them, but fails to have exactly matching data.
Here is my initial thought, concatenate a string made up of the following:
First Initial
LEFT8 of the LastName (allows inconsistent endings, such as "Esq." or "CPA")
LEFT3 of Zip
Street Number
LEFT8 of the StreetName (not Addr1 -- "Oak" for "8 N Oak Street")
Did I miss something here? I think I made it loose enough to overcome normal data entry inconsistencies, but tight enough to avoid incorrect matches.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我参与了一个为一家大型金融机构清理名称和地址数据的项目。我们自动实现了约 98.4% 的成功率,但不幸的是,这仍然留下了约 150,000 个不匹配的地方。
我们解决这个问题的方法是(随着时间的推移)建立可能发生的错误类型的规则库,并扩展逻辑的模糊性以涵盖已识别的错误类别。
确实可以通过参考(英国)邮政编码和门牌号和/或名称来完成大量数据清理。在英国,可以通过考虑邮政编码的第一部分来引入模糊性 - 这确定了一个广泛的区域。我不清楚这是否也适用于邮政编码。
然而,这种方法不能很好地处理不正常运行的地址——我自己的地址就是一个例子;我住在船上,因此有一些额外的地址以确保正确的寻址。
此类异常总是可能需要手动干预。
顺便说一句,您关于合并/汇总在同一地址的 First-Last 相同的人是理所当然的断言需要受到质疑。我们在数据清理方面遇到的最困难的情况正是有两个同名的人(例如父亲和儿子)住在同一地址。同样,如果同名的人购买了房产(这种情况发生),那么就会再次出现“重复”的问题。
I was involved in a project to clean up name and address data for a large financial institution. We achieved a success rate automatically of about 98.4% but unfortunately this still left about 150,000 mismatches.
The way we attacked the problem was to (over time) build up a rule base of the types of errors that could occur, and extending the fuzziness of the logic to cover identified classes of error.
A significant amount of data cleansing can indeed be done by reference to (UK) post codes and house number and/or name. In the UK fuzziness can be introduced by consideration of the first part of the post code - which determines a wide area. I'm not clear whether the same applies to zip codes.
However this approach does not deal well with addresses that are out of the normal run - my own address is an example; I live on a boat, and as a consequence have some additional pieces of address in order to ensure correct addressing.
Anomalies of this sort are always likely to need manual intervention.
Incidentally, your assertion that it's a no-brainer to merge/rollup people whose First-Last is the same at the same address needs to be challenged. The most difficult cases we had in data cleansing were precisely where there were two people (eg father and son) of the same name living at the same address. Equally, if somebody of the same name bought a property (which happens) then again there are problems of "re-duplication".
Chris A.,您是否考虑过使用官方专家系统来完成这项任务?值得注意的是,正如您所发现的,标准化地址以便您可以快速有效地迭代它们变得非常困难。在 SmartyStreets(我工作的地方),这就是我们的业务核心:执行执行此任务的某些算法。
这可能不是对您的确切问题的直接答案,但这是在开发模糊搜索查询时至关重要的一步,您可以从良好的数据开始。换句话说,正如 Chris W. 在他的回答中所表明的那样,即使在模糊查询之后,仍有很多需要改进的地方。
因此,我建议首先真正标准化所有地址(考虑到地址“过载”本身,两个地址看起来完全不同,但实际上是相同的地址)。对于位于美国的地址,您可以尝试列表处理服务(例如 CASS-Certified Scrubbing< /a>;根据您自己的选择进行研究)。好的方法会为您标记重复项,然后让您采取行动。地址标准化和标记后,您可以根据您的企业定义(按姓氏等)更快地清除精确的重复项。那时,您将对除最棘手的地址之外的任何内容运行模糊搜索,并且您已经很清楚什么是重复的。
Chris A., have you considered employing official expert systems at this task? Remarkably, as you're finding, standardizing addresses so you can iterate through them effectively gets very difficult very fast. At SmartyStreets (where I work), that's our business core: the implementation of certain algorithms which do this task.
This may not a direct answer to your exact question, but it's a vital step along the way that, in developing a fuzzy search query, you have good data to begin with. In other words, as Chris W. has shown in his answer, even after a fuzzy query, there's much left to be desired.
So I'd suggest first truly standardizing all the addresses (accounting for address "overloads" per-se, two addresses looking totally different, but are the same address). For US-based addresses, you could try a list processing service (like CASS-Certified Scrubbing; research for your own choice). A good one will flag duplicates for you, then let you take action. After the addresses have been normalized and flagged, you can much more quickly weed out the exact duplicates based on your business' definition (by family name, etc). At that point you'd run your fuzzy search against anything except addresses which are the trickiest, and you already have a good idea about what's a duplicate may be.