如何智能解析姓氏
假设西方命名约定为FirstName MiddleName(s) LastName
,
从全名中正确解析出姓氏的最佳方法是什么?
例如:
John Smith --> 'Smith'
John Maxwell Smith --> 'Smith'
John Smith Jr --> 'Smith Jr'
John van Damme --> 'van Damme'
John Smith, IV --> 'Smith, IV'
John Mark Del La Hoya --> 'Del La Hoya'
……以及由此而来的无数其他排列。
Assuming western naming convention of FirstName MiddleName(s) LastName
,
What would be the best way to correctly parse out the last name from a full name?
For example:
John Smith --> 'Smith'
John Maxwell Smith --> 'Smith'
John Smith Jr --> 'Smith Jr'
John van Damme --> 'van Damme'
John Smith, IV --> 'Smith, IV'
John Mark Del La Hoya --> 'Del La Hoya'
...and the countless other permutations from this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
也许最好的答案就是不要尝试。名字是个性化和独特的,即使将自己限制在西方传统中,你也永远无法确定你会考虑到所有的边缘情况。我的一个朋友合法地将自己的名字改为一个单词,他在与各种机构打交道时度过了一段痛苦的时光,而这些机构的程序无法处理这个问题。您处于一个独特的位置,您是创建执行过程的软件的人,因此您有机会设计一些不会惹恼具有非常规名称的人的东西。考虑一下为什么您需要首先解析姓氏,然后看看是否还有其他可以做的事情。
话虽如此,作为纯粹的技术问题,最好的方法可能是从末尾专门修剪字符串“Jr”、“、Jr”、“、Jr.”、“III”、“、III”等包含名称的字符串,然后获取从字符串中最后一个空格到(新的,删除 Jr 等后)结尾的所有内容。从你的例子中,这不会得到“Del La Hoya”,但你甚至不能真正指望一个人能得到它——我有根据地猜测约翰·马克·德尔·拉·霍亚的姓氏是“Del La Hoya” La Hoya”而不是“Mark Del La Hoya”,因为我的母语是英语,而且我对西班牙姓氏有一些直觉 - 如果名字是“Gauthip Yeidze Ka Illunyepsi”,我绝对不会考虑是否将 Ka 算作姓氏的一部分,因为我不知道它来自哪种语言。
Probably the best answer here is not to try. Names are individual and idosyncratic and, even limiting yourself to the Western tradition, you can never be sure that you'll have thought of all the edge cases. A friend of mine legally changed his name to be a single word, and he's had a hell of a time dealing with various institutions whose procedures can't deal with this. You're in a unique position of being the one creating the software that implements a procedure, and so you have an opportunity to design something that isn't going to annoy the crap out of people with unconventional names. Think about why you need to be parsing out the last name to begin with, and see if there's something else you could do.
That being said, as a purely techincal matter the best way would probably be to trim off specifically the strings " Jr", ", Jr", ", Jr.", "III", ", III", etc. from the end of the string containing the name, and then get everything from the last space in the string to the (new, after having removed Jr, etc.) end. This wouldn't get, say, "Del La Hoya" from your example, but you can't even really count on a human to get that - I'm making an educated guess that John Mark Del La Hoya's last name is "Del La Hoya" and not "Mark Del La Hoya" because I"m a native English speaker and I have some intuition about what Spanish last names look like - if the name were, say "Gauthip Yeidze Ka Illunyepsi" I would have absolutely no idea whether to count that Ka as part of the last name or not because I have no idea what language that's from.
遇到了一个名为“nameparser”的库
https://pypi.python.org/pypi/nameparser
它处理上述六种情况中的四种:
Came across a lib called "nameparser" at
https://pypi.python.org/pypi/nameparser
It handles four out of six cases above:
我在此支持 Tnekutippa,但您应该查看命名实体识别。它可能有助于自动化某些过程。然而,正如所指出的,这是相当困难的。我不太确定斯坦福 NER 是否可以直接提取名字和姓氏,但机器学习方法可能对这项任务非常有用。斯坦福 NER 可能是一个很好的起点,或者您可以尝试制作自己的分类器和训练语料库。
I'm seconding Tnekutippa here, but you should check out named entity recognition. It might help automate some of the process. This is however, as noted, quite difficult. I'm not quite sure if the Stanford NER can extract first and last names out of the box, but a machine learning approach could prove very useful for this task. The Stanford NER could be a nice starting point, or you could try to make your own classifiers and training corpora.