我有一个名字列表,其中一些是假名,我需要使用 NLP 和 Python 3.1 来保留真实姓名并扔掉假名

发布于 2024-08-24 05:08:15 字数 224 浏览 5 评论 0原文

我不知道从哪里开始。我从来没有做过任何 NLP,只用 Python 3.1 编程,我必须使用它。我正在查看该网站 http://www.linkedin.com 并且我必须收集所有公众个人资料,其中一些有非常假的名字,比如“aaaaaa k dudujjek”,我被告知我可以使用 NLP 来查找真实姓名,我该从哪里开始呢?

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

<逆流佳人身旁 2024-08-31 05:08:15

这是一个很难解决的问题,首先要获取有效的名字和名称。姓氏列表。

您正在评估的名称集有多大?它们来自哪里?这些都是您需要考虑的重要事项。例如,如果您正在评估一小组“美国”姓名,则您的有效姓名列表将与日本或印度姓名列表有很大不同。

您抓取 LinkedIn 的想法是正确的,但您发现虚假个人资料/姓名缺陷的做法是正确的。更好的网站可能类似于 IMDB (也许会抓取姓名 通过迭代不同的出生年份),或维基百科的最受欢迎的名字列表< /a> 和最常见的姓氏

归根结底,这是一个精确度与召回率的问题:为了漏掉更少的假货,你不可避免地会扔掉一些真实姓名。如果放松限制,你会得到更多的假货,但你扔掉的真名也会更少。

This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.

How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.

Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.

When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.

鸩远一方 2024-08-31 05:08:15

这里有几种可能性,但最明显的似乎是 HMM,即 隐藏马尔可夫模型NLTK 套件包含[至少]一个用于 HMM 的模块,尽管我必须承认我从来没有用过它。

另一个可能的障碍是 AFAIK,NTLK 尚未移植到 Python 3.0

这就是说,虽然我非常热衷于在适用的情况下使用 NLP 技术,但我认为使用多种范例(包括一些 NLP 技巧)的过程可能是一个针对这个特定问题的更好的解决方案。例如,即使在传统数据库中存储常见姓氏(和名字)的简化字典,也可以提供一种更可靠、计算效率更高的方法来过滤大部分输入数据,从而将宝贵的 CPU 资源用于不太明显的情况。

Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.

Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0

This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.

↘紸啶 2024-08-31 05:08:15

恐怕这个问题无法解决,即使你的名单只是最低限度的“开放”——如果这些名字是来自一小部分传统行为人群的客户,你最终可能会得到数千人的几百个名字。但一般来说,你很难预测什么是真名,什么不是真名,无论阿拉伯语、中文或班图语名字在南英语乡村社区名称样本中看起来多么不寻常。我的意思是,“Ng”是一个常见的广东姓氏,“O”在韩国很常见,所以假设可能会失败。奥地利有一个地方叫“fucking”,所以即使寻找四个字母的单词也不能保证成功。

您可以做的就是处理足够大的此类名称样本并手动将其分类。然后,使用各种文本处理工具并收集指标。也许你可以推导出一个名字被识别为假名的一定可能性,也许它不可行。不过,在这里你永远不会超越可能性。

顺便说一句,几年前我们曾经使用谷歌地图和电话簿来验证客户数据。如果谷歌地图可以找到该地点,我们称该地址已验证。显然,在更严格的要求下,真正的验证必须走得更远。我们不要忘记,此类数据的验证更多的是一个社会问题,而不是语言问题。

i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.

what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.

as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文