使用 Soundex、Jaro Winkler 和编辑距离 (UTL_MATCH) 匹配 Oracle 重复列值

发布于 2024-12-17 06:09:11 字数 872 浏览 6 评论 0原文

我正在尝试找到一种可靠的方法来匹配数据库中的重复人员记录。这些数据存在一些严重的数据质量问题，我也在尝试修复这些问题，但在我获得批准之前，我只能使用现有的数据。

我可用的表列有：

SURNAME       VARCHAR2(43)
FORENAME      VARCHAR2(38)
BIRTH_DATE    DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE      VARCHAR2(15)

SOUNDEX 函数对此用途相对有限，但 UTL_MATCH 包似乎使用 Jaro Winker 算法提供了更好的匹配级别。

有没有人实现了一种可靠的方法来匹配此类数据，而不是重新发明轮子？

需要解决的数据质量问题：

邮政编码虽然是强制性的，但并不总是完整输入。
地址数据的质量相对较差，地址输入的格式不固定（即，有些可能将 line1 设为“Flat 1”，而有些可能将 line1 设为“Flat1, 22 Acacia Ave”）。
名字列可以包含首字母、完整名字或有时包含多个名字。

例如，我正在考虑：

连接所有地址字段并将 Jaro Winkler 算法应用于完整地址，并结合对连接在一起的全名进行类似的测试。

可以直接比较出生日期进行匹配，但由于数据量很大，仅进行匹配是不够的。

Oracle 10g R2 企业版。

欢迎任何有用的建议。

原文

I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.

The table columns available to me are:

SURNAME       VARCHAR2(43)
FORENAME      VARCHAR2(38)
BIRTH_DATE    DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE      VARCHAR2(15)

The SOUNDEX function is relatively limited for this use but the UTL_MATCH package seems to offer a better level of matching using the Jaro Winker algorithm.

Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?

Data Quality issues to contend with:

The postcode, though mandatory, isn't always fully entered.
The address data is relatively poor quality with addresses entered in no fixed format (i.e. some may have line1 as "Flat 1" whereas some may have line1 as "Flat1, 22 Acacia Ave").
The forename column can contain an initial, a full forename or sometimes more than one forename.

For example I was considering:

Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.

The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.

Oracle 10g R2 Enterprise Edition.

Any helpful suggestions welcome.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深爱不及久伴 2024-12-24 06:09:11

“我正在尝试寻找一种可靠的方法来匹配重复的人
数据库中的记录。”

没有这样的事情。您最多希望的是一个具有合理怀疑元素的系统。SOUNDEX

SQL> select n1
       , n2
       , soundex(n1) as sdx_n1
       , soundex(n2) as sdx_n2
       , utl_match.edit_distance_similarity(n1, n2) as ed
       , utl_match.jaro_winkler_similarity(n1, n2) as jw   
from t94
order by n1, n2
/


  2    3    4    5    6    7    8    9  
N1                   N2                   SDX_ SDX_         ED         JW
-------------------- -------------------- ---- ---- ---------- ----------
MARK                 MARKIE               M620 M620         67         93
MARK                 MARKS                M620 M620         80         96
MARK                 MARKUS               M620 M622         67         93
MARKY                MARKIE               M620 M620         67         89
MARSK                MARKS                M620 M620         60         95
MARX                 AMRX                 M620 A562         50         91
MARX                 M4RX                 M620 M620         75         85
MARX                 MARKS                M620 M620         60         84
MARX                 MARSK                M620 M620         60         84
MARX                 MAX                  M620 M200         75         93
MARX                 MRX                  M620 M620         75         92

11 rows selected.

SQL> SQL> SQL>

的一大优点是它对字符串进行标记。这意味着它给您一些东西可以被索引：当涉及到大量数据时，它非常有价值，但它是旧的和粗糙的，例如 Metaphone 和 Double Metaphone 你应该能够。查找 PL/SQL 实现通过 Google 进行

评分的优点是可以实现一定程度的模糊性，因此您可以找到 name_score >= 90% 的所有行，但最严重的缺点是分数是相对的。您无法对它们进行索引。这

意味着：

您需要混合使用多种策略来解决您的问题。
比较 MARX 与 MRX 和 M4RX 的分数：剥离数字名字可以提高命中率。
你无法即时获得大量名字。如果可以的话，使用标记化和预评分。如果您的流失率不高，请使用缓存。如果可以的话，请使用分区。
使用 Oracle Text（或类似的）构建昵称和变体的同义词库。
Oracle 11g 向 Oracle Text 引入了特定的名称搜索功能。了解更多信息。
构建一个表评分的规范名称并将实际数据记录链接到该名称。
使用其他数据值，尤其是可索引的数据值（例如出生日期）来预先过滤大量姓名或增加对建议匹配的置信度。
请注意，其他数据值也有其自身的问题：出生于 2011 年 1 月 31 日的人是十一个月还是八十岁？
请记住，名字很棘手，尤其是当您必须考虑已罗马化的名字时：Moammar Khadaffi（罗马字母表）有四百多种不同的拼写方式 - 甚至谷歌也无法就哪种变体最规范达成一致。

根据我的经验，连接标记（名字、姓氏）是一件喜忧参半的事情。它解决了某些问题（例如道路名称是否出现在地址行 1 或地址行 2 中），但会导致其他问题：考虑对 GRAHAM OLIVER 与 OLIVER GRAHAM 进行评分，与对 OLIVER 与 OLIVER、GRAHAM 与 GRAHAM、OLIVER 与 GRAHAM 以及 GRAHAM 与 OLIVER 进行评分。

无论您做什么，您仍然会出现误报和错过命中的情况。没有算法可以防止拼写错误（尽管 Jaro Winkler 在 MARX 与 AMRX 方面做得相当不错）。

"I am trying to find a reliable method for matching duplicate person
records within the database."

Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.

SQL> select n1
       , n2
       , soundex(n1) as sdx_n1
       , soundex(n2) as sdx_n2
       , utl_match.edit_distance_similarity(n1, n2) as ed
       , utl_match.jaro_winkler_similarity(n1, n2) as jw   
from t94
order by n1, n2
/


  2    3    4    5    6    7    8    9  
N1                   N2                   SDX_ SDX_         ED         JW
-------------------- -------------------- ---- ---- ---------- ----------
MARK                 MARKIE               M620 M620         67         93
MARK                 MARKS                M620 M620         80         96
MARK                 MARKUS               M620 M622         67         93
MARKY                MARKIE               M620 M620         67         89
MARSK                MARKS                M620 M620         60         95
MARX                 AMRX                 M620 A562         50         91
MARX                 M4RX                 M620 M620         75         85
MARX                 MARKS                M620 M620         60         84
MARX                 MARSK                M620 M620         60         84
MARX                 MAX                  M620 M200         75         93
MARX                 MRX                  M620 M620         75         92

11 rows selected.

SQL> SQL> SQL>

The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.

The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.

What this means is:

You need a mix of strategies. No single algorithm will solve your problem.
Data cleansing is useful. Compare the scores for MARX vs MRX and M4RX: stripping numbers out of names improves the hit rate.
You cannot score big volumes of names on the fly. Use tokenizing and pre-scoring if you can. Use caching if you don't have a lot of churn. Use partitioning if you can afford it.
Use a Oracle Text (or similar) to build a thesaurus of nicknames and variants.
Oracle 11g introduced specific name search functionality to Oracle Text. Find out more.
Build a table of canonical names for scoring and link actual data records to that.
Use other data values, especially indexable ones like date of birth, to pre-filter large volumes of names or to increase confidence in proposed matches.
Be aware that other data values come with their own problems: is someone born on 31/01/11 eleven months old or eighty years old?
Remember that names are tricky, especially when you have to consider names which have been romanized: there are over four hundred different ways of spelling Moammar Khadaffi (in the roman alphabet) - and not even Google can agree on which variant is the most canonical.

In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.

Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).

回复收藏 0 原文

~没有更多了~