使用 Soundex、Jaro Winkler 和编辑距离 (UTL_MATCH) 匹配 Oracle 重复列值
我正在尝试找到一种可靠的方法来匹配数据库中的重复人员记录。这些数据存在一些严重的数据质量问题,我也在尝试修复这些问题,但在我获得批准之前,我只能使用现有的数据。
我可用的表列有:
SURNAME VARCHAR2(43)
FORENAME VARCHAR2(38)
BIRTH_DATE DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE VARCHAR2(15)
SOUNDEX
函数对此用途相对有限,但 UTL_MATCH
包似乎使用 Jaro Winker 算法提供了更好的匹配级别。
有没有人实现了一种可靠的方法来匹配此类数据,而不是重新发明轮子?
需要解决的数据质量问题:
- 邮政编码虽然是强制性的,但并不总是完整输入。
- 地址数据的质量相对较差,地址输入的格式不固定(即,有些可能将 line1 设为“Flat 1”,而有些可能将 line1 设为“Flat1, 22 Acacia Ave”)。
- 名字列可以包含首字母、完整名字或有时包含多个名字。
例如,我正在考虑:
连接所有地址字段并将 Jaro Winkler 算法应用于完整地址,并结合对连接在一起的全名进行类似的测试。
可以直接比较出生日期进行匹配,但由于数据量很大,仅进行匹配是不够的。
Oracle 10g R2 企业版。
欢迎任何有用的建议。
I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.
The table columns available to me are:
SURNAME VARCHAR2(43)
FORENAME VARCHAR2(38)
BIRTH_DATE DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE VARCHAR2(15)
The SOUNDEX
function is relatively limited for this use but the UTL_MATCH
package seems to offer a better level of matching using the Jaro Winker algorithm.
Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?
Data Quality issues to contend with:
- The postcode, though mandatory, isn't always fully entered.
- The address data is relatively poor quality with addresses entered in no fixed format (i.e. some may have line1 as "Flat 1" whereas some may have line1 as "Flat1, 22 Acacia Ave").
- The forename column can contain an initial, a full forename or sometimes more than one forename.
For example I was considering:
Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.
The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.
Oracle 10g R2 Enterprise Edition.
Any helpful suggestions welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
没有这样的事情。您最多希望的是一个具有合理怀疑元素的系统。SOUNDEX
的一大优点是它对字符串进行标记。这意味着它给您一些东西可以被索引:当涉及到大量数据时,它非常有价值,但它是旧的和粗糙的,例如 Metaphone 和 Double Metaphone 你应该能够。查找 PL/SQL 实现通过 Google 进行
评分的优点是可以实现一定程度的模糊性,因此您可以找到 name_score >= 90% 的所有行,但最严重的缺点是分数是相对的。您无法对它们进行索引。这
意味着:
根据我的经验,连接标记(名字、姓氏)是一件喜忧参半的事情。它解决了某些问题(例如道路名称是否出现在地址行 1 或地址行 2 中),但会导致其他问题:考虑对 GRAHAM OLIVER 与 OLIVER GRAHAM 进行评分,与对 OLIVER 与 OLIVER、GRAHAM 与 GRAHAM、OLIVER 与 GRAHAM 以及 GRAHAM 与 OLIVER 进行评分。
无论您做什么,您仍然会出现误报和错过命中的情况。没有算法可以防止拼写错误(尽管 Jaro Winkler 在 MARX 与 AMRX 方面做得相当不错)。
Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.
The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.
The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows
where name_score >= 90%
. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.What this means is:
In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.
Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).