名字拼写的变化
作为联系人管理系统的一部分,我有一个大型姓名数据库。人们经常编辑此内容,因此我们遇到了同一个人以不同形式存在的问题(约翰·史密斯和乔纳森·史密斯)。我研究了单词相似性,但很容易想到根本不相似的名称变体(理查德与迪克)。我想知道是否有一个常见的英文名字变体列表,我可以用它来检测和纠正此类错误。
As part of a contact management system I have a large database of names. People frequently edit this and as a result we run into issues of the same person existing in different forms (John Smith and Jonathan Smith). I looked into word similarity but it's easy to think of name variations which are not similar at all (Richard vs Dick). I was wondering if there was a list of common English first name variations that I could use to detect and correct such errors.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我会抓取有关人名的所有维基百科页面(有可用的维基百科数据转储),例如 http:// /en.wikipedia.org/wiki/Teresa(来自 http://en.wikipedia .org/wiki/Category:English_given_names),并创建一个索引,您可以使用该索引来建议人们正确的形式(您将根据数据库中名字变体的数量对它们进行排名)。不幸的是我不知道。这样的数据库。
I would crawl all wikipedia pages (there is an available dump of wikipedia data) on people names, e.g., http://en.wikipedia.org/wiki/Teresa (from http://en.wikipedia.org/wiki/Category:English_given_names), and create an index that you can use to suggest people correct forms (you will rank them by the number of first name variants in your database). Unfortunately I do not know. such a database.
此线程指向昵称/first列表人口普查中的姓名地图:
http://deron.meranda.us/data/nicknames.txt
This thread points to a list of nickname/first name maps from the census:
http://deron.meranda.us/data/nicknames.txt