比较人名以检测相同性的算法
我正在研究地址簿同步算法。我想重用一些代码(如果存在),但尚未找到。
有人知道一种算法可以用数字/浮点/百分比告诉我两个名字有多少相同吗?在这种方法中,Levenstein 距离并不好,因为姓名和我们的地址簿与每个姓名部分的开头相匹配。
John Smith
应该匹配乔恩·史密斯
、乔纳森·史密斯
、约翰尼·史密斯
I am working on address book synchronization algorithm. I would like to reuse some code if there exists, but couldn't find one yet.
Does someone know about an algorithm that will tell me in numbers/float/procent how much two names are identical. Levenstein distance is not good in this approach, as names and our adddress books are matching the begining of each of the name sections.
John Smith
should matchSmith Jon
, Jonathan Smith
, Johnny Smith
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
也看看 Jaro Winkler 算法。这对名字很有好处。
http://en.wikipedia.org/wiki/Jaro%E2%80% 93Winkler_distance
如果您有名字、姓氏问题,那么您可以对它们进行排序,以确保 Smith John 保存为 John Smith
Have a look at the Jaro Winkler algorithm too. It is good for names.
http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
If you have first name, last name issues then you could just sort them to make sure Smith John is saved as John Smith
您应该查看字符串比较算法,例如 Levenshtein 或 Smith-Waterman。这是一个很棒的库,可以帮助您入门
You should be looking at string comparison algorithms such as Levenshtein or Smith-Waterman. Here is a great library to get you started
要真正获得这些类型的情况,您可能需要一个别名表,但我认为 Soundex 会让您接近。
http://commons.apache.org/编解码器/apidocs/org/apache/commons/codec/language/Soundex.html
To really get those kinds of cases you may need an aliases table, but I think Soundex will get you close.
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html
对于名称,我想出了一个类似于 metaphone 的算法。
您还需要一些逻辑将字符串分解为姓氏、名字、头衔等。它可能会变得复杂。
有一些边缘情况。如果某人的头衔是“教授”,您不希望将其解释为名字。如果他们的开头有“Lord”,那么这可能是他们的名字(很多人都被称为Lord)或他们的头衔。等等。最好已经有他们的标准形式的名字,其中您知道他们的姓氏、名字和头衔。
我已经编写了一些 PHP 代码来执行此操作:请参阅 名称(参见similarityto()函数),textfuzzy,概率。
For names, I came up with an algorithm similar to metaphone.
You also need some logic to break up the string into surname, given names, title etc. It can get complicated.
There are edge cases. If someone has the title "Professor" you don't want that interpreted as a first name. And if they have "Lord" at the start that could either be their first name (plenty of people are called Lord) or their title. And so on. It's best if you have their name already in a standard form where you know what is their surname, given names and title.
I've written some PHP code to do this: see name (see similarityto() function), textfuzzy, probability.