根据人名匹配记录

发布于 2024-07-24 09:28:44 字数 397 浏览 9 评论 0原文

是否有任何工具或方法可用于在两个不同数据源之间进行人名匹配?

这些系统没有其他通用信息,并且在许多情况下输入的名称都不同。

非精确匹配示例:

King Jr., Martin Luther = King, Martin(不包括后缀)
Erving, Dr. J. = Erving, J.(不包括前缀)
奥巴马,巴拉克·侯赛因 = 奥巴马,巴拉克(不包括中间名)
Pufnstuf, HR = Pufnstuf, Haibane Renmei (比赛缩写)
Tankengine, Thomas = Tankengine, Tom(匹配常见昵称)
Flair, Rick "the Natureboy" = Flair, Natureboy(昵称匹配)

Are there any tools or methods that can be used for matching by a person's name between two different data sources?

The systems have no other common information and the names have been entered differently in many cases.

Examples of non-exact matches:

King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

懒的傷心 2024-07-31 09:28:44

我必须使用建议的各种技术。 感谢您为我指明了正确的方向。 希望以下内容能够帮助其他人解决此类问题。

删除多余字符

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

用法:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')  

将名称拆分为多个部分

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
        [ID] VARCHAR(MAX),
        [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val)   
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

用法:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

常用昵称:

I基于此列表创建了一个表,并使用它来连接通用名称等效项。

用法:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

语音算法 (Jaro Winkler):

令人惊叹的文章,Beyond SoundEx - MS SQL Server 中的模糊搜索功能,展示如何安装和使用SimMetrics< /a> 库进入 SQL Server。 该库可让您找到字符串之间的相对相似性,并包含多种算法。 我最终主要使用 Jaro Winkler 来匹配名称。

用法:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
        max(
            dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
            * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
        ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))

I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

Removing excess characters

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

Usage:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')  

Split name into parts

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
        [ID] VARCHAR(MAX),
        [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val)   
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

Usage:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

Common nicknames:

I created a table based on this list and used it to join on common name equivalents.

Usage:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

Phonetic algorithms (Jaro Winkler):

The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.

Usage:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
        max(
            dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
            * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
        ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
波浪屿的海角声 2024-07-31 09:28:44

这是一个非常复杂的问题 - 有很多昂贵的工具可以正确地解决它。
如果您想知道为什么不能以汤姆、迪克或哈利(或比尔)的身份办理航班登机手续
或者为什么禁飞名单和恐怖分子观察名单不起作用 - 考虑一下:

(1) 穆阿迈尔·卡扎菲
(2) 穆阿迈尔·卡扎菲
(3) 穆阿迈尔·卡扎菲
(4) 穆阿迈尔·卡扎菲
(5) 穆阿迈尔·卡扎菲
(6) 穆阿迈尔·卡扎菲
(7) 穆阿迈尔·卡扎菲
(8) 莫阿梅尔·埃尔·卡扎菲
(9) 穆阿马尔·卡扎菲
(10) 穆阿迈尔·卡萨菲
(11) 穆阿迈尔·卡塔菲
(12) 穆阿迈尔·卡扎菲
(13) 莫阿马尔·卡扎菲
(14) 穆阿迈尔·卡扎菲
(15) 穆阿迈尔·卡扎菲
(16) 穆阿迈尔·卡扎菲
(17) 穆阿马尔·卡扎菲
(18) 穆阿迈尔·卡扎菲
(19)穆阿迈尔·卡扎菲
(20) 穆阿迈尔·卡扎菲
(21) 穆阿马尔·卡扎菲
(22) 穆阿迈尔·卡扎菲
(23)穆阿迈尔·卡扎菲
(24)穆阿迈尔·卡扎菲
(25) 穆阿迈尔·卡扎菲
(26) 穆阿迈尔·夸塔菲
(27)穆阿迈尔·格达菲
(28) 穆阿马尔·卡扎菲
(29) 穆阿迈尔·卡扎菲
(30) 莫阿迈尔·库扎菲
(31) 穆阿迈尔·卡扎菲
(32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi

这只是官方拼写 - 不包括拼写错误!

It's a very complex problem - and there are a lot of expensive tools to do it correctly.
If you ever wondered why you can't check in on a flight as Tom, Dick or Harry (or Bill)
Or why no-fly lists and terrorists watch lists don't work -consider:

(1) Muammar Qaddafi
(2) Mo'ammar Gadhafi
(3) Muammar Kaddafi
(4) Muammar Qadhafi
(5) Moammar El Kadhafi
(6) Muammar Gadafi
(7) Mu'ammar al-Qadafi
(8) Moamer El Kazzafi
(9) Moamar al-Gaddafi
(10) Mu'ammar Al Qathafi
(11) Muammar Al Qathafi
(12) Mo'ammar el-Gadhafi
(13) Moamar El Kadhafi
(14) Muammar al-Qadhafi
(15) Mu'ammar al-Qadhdhafi
(16) Mu'ammar Qadafi
(17) Moamar Gaddafi
(18) Mu'ammar Qadhdhafi
(19) Muammar Khaddafi
(20) Muammar al-Khaddafi
(21) Mu'amar al-Kadafi
(22) Muammar Ghaddafy
(23) Muammar Ghadafi
(24) Muammar Ghaddafi
(25) Muamar Kaddafi
(26) Muammar Quathafi
(27) Muammar Gheddafi
(28) Muamar Al-Kaddafi
(29) Moammar Khadafy
(30) Moammar Qudhafi
(31) Mu'ammar al-Qaddafi
(32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi

And that's just official spellings - it doesn't include typos!

你不是我要的菜∠ 2024-07-31 09:28:44

对于这种情况,我经常使用 soundex 类型的算法。 尝试使用 Double Metaphone 算法。 如果您使用 SQL Server,则有一些源代码可用于创建用户定义函数。

因为您已经转置了数据,所以您可能需要对其进行一些规范化,例如,删除所有逗号并按第一个字母对结果单词进行排序。 这会给你一些更好的匹配潜力。 如果中间加上了文字,那就有点难了。 您可以考虑将名称分解为单词,使用 Double Metaphone 检查另一列中是否有匹配的单词,然后收集匹配与单词的总数,这将告诉您两列的接近程度。

在进行比较之前,我还会过滤掉 Dr.、Mr.、Ms.、Mrs. 等常用词。

I often employ soundex-type algorithms for this type of situation. Try the Double Metaphone algorithm. If you are using SQL Server, there is some source code to create a user defined function.

Because you have transposed data, you may want to normalize it a bit, e.g., remove all commas and the sort resulting words by first letter. That will give you some better matching potential. In the case where words have been added in the middle, it gets a bit tougher. You could consider breaking a name into words, checking with Double Metaphone whether there is a word in the other column that matches, and then collect the overall count of matches vs. words, which will tell you how close the two columns are.

I would also filter out common words like Dr., Mr., Ms., Mrs., etc., before doing the comparisons.

相思故 2024-07-31 09:28:44

以下是一些选项:

语音算法...

Soundex (http://en.wikipedia.org/wiki/ Soundex)

双变音位 (http://en.wikipedia.org/wiki/Double_Metaphone)

编辑距离 (http://en.wikipedia.org/wiki/Levenshtein_distance)

Jaro-Winkler 距离 (http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

您可以尝试的另一件事是将每个单词(按空格或连字符分割)与另一个名称中的每个单词进行比较,看看有多少单词匹配。 也许将其与语音算法结合起来以获得更模糊的匹配。 对于庞大的数据集,您可能需要对每个单词建立索引并将其与名称 id 进行匹配。 对于缩写匹配,您可以仅比较第一个字母。 当您比较单词时,您可能也想忽略除字母之外的任何内容。

许多语音算法都有开源/在线示例。

Here are some options:

Phonetic algorithms...

Soundex (http://en.wikipedia.org/wiki/Soundex)

Double Metaphone (http://en.wikipedia.org/wiki/Double_Metaphone)

Edit Distance (http://en.wikipedia.org/wiki/Levenshtein_distance)

Jaro-Winkler Distance (http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Another thing you could try would be to compare each word (splitting on space and maybe hyphen) with each word in the other name and see how many words match up. Maybe combine this with phonetic algorithms for more fuzzy matching. For a huge data set, you would want to index each word and match it with a name id. For abbreviation matching you could compare just the first letter. You probably want to ignore anything but letters when you compare words as well.

Many of the phonetic algorithms have open source / samples online.

幸福还没到 2024-07-31 09:28:44

Metaphone 3 是第三代 Metaphone 算法。
它提高了拼音编码的准确性,比 Double 的 89%
变音位 98%,针对最常见的数据库进行测试
北方人熟悉的英语单词、名字和非英语单词
美国。 这产生了极其可靠的语音编码
美式发音。

Metaphone 3 由 Lawrence Philips 设计和开发,他
设计并开发了原始 Metaphone 和 Double Metaphone
算法。

Metaphone 3 is the third generation of the Metaphone algorithm.
It increases the accuracy of phonetic encoding from the 89% of Double
Metaphone to 98%, as tested against a database of the most common
English words, and names and non-English words familiar in North
America. This produces an extremely reliable phonetic encoding for
American pronunciations.

Metaphone 3 was designed and developed by Lawrence Philips, who
designed and developed the original Metaphone and Double Metaphone
algorithms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文