比较人名以检测相同性的算法

发布于 2024-09-01 22:46:34 字数 237 浏览 2 评论 0原文

我正在研究地址簿同步算法。我想重用一些代码(如果存在),但尚未找到。

有人知道一种算法可以用数字/浮点/百分比告诉我两个名字有多少相同吗?在这种方法中,Levenstein 距离并不好,因为姓名和我们的地址簿与每个姓名部分的开头相匹配。

John Smith 应该匹配
乔恩·史密斯乔纳森·史密斯约翰尼·史密斯

I am working on address book synchronization algorithm. I would like to reuse some code if there exists, but couldn't find one yet.

Does someone know about an algorithm that will tell me in numbers/float/procent how much two names are identical. Levenstein distance is not good in this approach, as names and our adddress books are matching the begining of each of the name sections.

John Smith should match
Smith Jon, Jonathan Smith, Johnny Smith

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

心房敞 2024-09-08 22:46:34

也看看 Jaro Winkler 算法。这对名字很有好处。
http://en.wikipedia.org/wiki/Jaro%E2%80% 93Winkler_distance

如果您有名字、姓氏问题,那么您可以对它们进行排序,以确保 Smith John 保存为 John Smith

Have a look at the Jaro Winkler algorithm too. It is good for names.
http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

If you have first name, last name issues then you could just sort them to make sure Smith John is saved as John Smith

如梦初醒的夏天 2024-09-08 22:46:34

您应该查看字符串比较算法,例如 Levenshtein 或 Smith-Waterman。这是一个很棒的库,可以帮助您入门

You should be looking at string comparison algorithms such as Levenshtein or Smith-Waterman. Here is a great library to get you started

一页 2024-09-08 22:46:34

要真正获得这些类型的情况,您可能需要一个别名表,但我认为 Soundex 会让您接近。

http://commons.apache.org/编解码器/apidocs/org/apache/commons/codec/language/Soundex.html

To really get those kinds of cases you may need an aliases table, but I think Soundex will get you close.

http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html

长途伴 2024-09-08 22:46:34

对于名称,我想出了一个类似于 metaphone 的算法。

您还需要一些逻辑将字符串分解为姓氏、名字、头衔等。它可能会变得复杂。

有一些边缘情况。如果某人的头衔是“教授”,您不希望将其解释为名字。如果他们的开头有“Lord”,那么这可能是他们的名字(很多人都被称为Lord)或他们的头衔。等等。最好已经有他们的标准形式的名字,其中您知道他们的姓氏、名字和头衔。

我已经编写了一些 PHP 代码来执行此操作:请参阅 名称(参见similarityto()函数),textfuzzy概率

For names, I came up with an algorithm similar to metaphone.

You also need some logic to break up the string into surname, given names, title etc. It can get complicated.

There are edge cases. If someone has the title "Professor" you don't want that interpreted as a first name. And if they have "Lord" at the start that could either be their first name (plenty of people are called Lord) or their title. And so on. It's best if you have their name already in a standard form where you know what is their surname, given names and title.

I've written some PHP code to do this: see name (see similarityto() function), textfuzzy, probability.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文