如何从姓氏猜出一个人的国籍?
我可以使用什么方法从姓氏来预测一个人的国籍?
我有大量作者的文本和姓氏列表。我想确定哪些文本是由拉丁语使用者撰写的,哪些文本是由英语母语人士撰写的,以便研究一组中的某些写作风格模式是否与另一组不同。
我在 google 和 pubmed 中查找过姓氏数据库,但找不到任何可以免费访问的数据库。另一种方法是使用一些正则表达式,例如“.*ez”来识别一些西班牙裔姓氏,例如“rodriguez”,但这并没有让我走得太远。
您有什么建议吗?由于我会在做出预测后手动修改所有关联,因此我不需要很高的准确性,但欢迎任何帮助或想法。
What approach can I use to predict the nationality of a person from the surname?
I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-language speakers and which texts have been written by native english speakers, in order to study if certain writing style patterns are different in one group compared to the other.
I have looked in google and in pubmed for a database of surnames, but I could not find any accessible for free. Another approach is to use some regexs, for example ".*ez" to identify some hispanic surnames such as 'rodriguez', but it doesn't get me very far.
Do you have any suggestion? Since I will manually revise all the associations after making the prediction, I don't need a great accuracy, but any help or idea will be welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不认为你能以任何程度的可靠性做到这一点。罗德里格斯很可能有一个西班牙血统的名字,但很可能在任何地方出生和长大。他们可能是第二代英国人,周围从来没有人说过西班牙语,因此属于以英语为母语的人。
I don't think you can do this with any degree of reliability. A Rodriguez may well have a Spanish origin name, but could well have been born and brought up anywhere. They could be second generation British, and never have had Spanish spoken around them, and so come into the category of Native English speaker.
如果是实际作者,那么也许您可以爬取亚马逊并检查他们的“作者信息”详细信息?
我想你猜不到。例如,爱尔兰姓氏——估计有 80,000,000 人具有爱尔兰血统,但其中 450 万人生活在爱尔兰/接受过爱尔兰教育。
If Actual authors then maybe you can spider amazon and check their 'Author information' details?
I don't think you can guess. E.g. Irish last names - there are an estimated 80,000,000 people with Irish heritage however on 4.5 million of these live in Ireland/went through Irish education.
没有任何有意义的方法可以做到这一点。没有理由说具有西班牙名字的人不能以英语为母语。
如果您无论如何都要修改它,为什么不使用您拥有的数据呢?
There is no meaningful way to do this. There is no reason why people with hispanic names cannot be native english speakers.
If you are going to revise it anyway, why not use the data you have?
假设您打算对文本进行编程比较,则必须手动对文本进行分类。不正确的猜测可能会导致您构建一个损坏的文本分析算法。这对于机器学习(例如人工神经网络)来说尤其成问题。
Assuming you are intending on doing a programmatic comparison of the texts, you have to manually categorize the texts. Incorrect guesses would likely lead you to build a broken algorithm for textual analysis. This will be especially problematic with machine learning, such as artificial neural networks.