如何在文本中搜索人名? (启发式)

发布于 2024-07-09 14:06:20 字数 697 浏览 10 评论 0原文

我有一个巨大的人名列表,我必须在巨大的文本中进行搜索。

文本中只能出现部分名称。 并且可能存在拼写错误打字错误缩写。 文本没有标记,因此我不知道文本中人名的开头位置。 我不知道这个名字是否会出现在文本中。

示例:

我的列表中有“巴拉克·侯赛因·奥巴马”,因此我必须检查以下文本中是否出现该名字:

  • ...候选人巴拉克·奥巴马是当选美国总统...(不完整)
  • ...候选人巴拉克·侯赛因当选美国总统...(不完整)
  • ...候选人巴拉克何当选美国总统...(缩写)
  • ...候选人巴拉克·奥巴马当选美国总统...(拼写错误)
  • ...候选人巴拉克·奥马玛当选美国总统...(打字错误,B在V旁边)
  • ...候选人约翰·麦凯恩输掉了选举。 ..(没有出现奥巴马的名字)

当然没有确定性的解决方案,但是......

对于这种搜索来说,什么是好的启发式方法?

如果必须的话,你会怎么做?

I have a huge list of person's full names that I must search in a huge text.

Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.

Example:

I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:

  • ...The candidate Barack Obama was elected the president of the United States... (incomplete)
  • ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
  • ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
  • ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
  • ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
  • ...The candidate John McCain lost the the election... (no occurrences of Obama name)

Certanily there isn't a deterministic solution for it, but...

What is a good heuristic for this kind of search?

If you had to, how would you do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

梦亿 2024-07-16 14:06:20

你说的是200页左右。

将其分成 200 个单页 PDF。

将每个页面以及姓名列表放在 Mechanical Turk 上。 每页提供约 5 美元的奖励。

You said it's about 200 pages.

Divide it into 200 one-page PDFs.

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

愚人国度 2024-07-16 14:06:20

将所有内容拆分为空格,删除特殊字符(逗号、句点等)。 然后使用类似 soundex 来处理拼写错误。 或者,如果您需要搜索大量文档,您可以使用 lucene 之类的东西。

Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

单挑你×的.吻 2024-07-16 14:06:20

您想要的是一个自然语言处理库。 您正在尝试识别专有名词的子集。 如果名称是专有名词的主要来源,那么这会很容易,如果混有相当数量的其他专有名词,那就会更困难。 如果您使用 JAVA 编写,请查看 OpenNLP 或 C# SharpNLP。 提取所有专有名词后,您可能可以使用 Wordnet 删除大多数非名称专有名词。 您也许可以使用 wordnet 来识别名称的子部分(例如“John”),然后搜索相邻的标记以吸收名称的其他部分。 您可能会遇到诸如“John Smith Industries”之类的问题。 您必须查看基础数据,看看是否有可以利用的功能来帮助缩小问题范围。

使用 NLP 解决方案是我见过的解决类似问题的唯一真正强大的技术。 您可能仍然会遇到问题,因为 200 页实际上相当小。 理想情况下,您将拥有更多文本并能够使用更多统计技术来帮助消除名称和非名称之间的歧义。

What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.

Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.

执妄 2024-07-16 14:06:20

乍一看,我想要一个索引服务器。 lucene、FAST 或 Microsoft 索引服务器。

At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.

小情绪 2024-07-16 14:06:20

我会使用 C# 和 LINQ。 我会标记空间上的所有单词,然后使用 LINQ 对文本进行排序(并且可能使用 Distinct() 函数)来隔离我感兴趣的所有文本。在操作文本时,我会跟踪索引(您可以使用 LINQ 来完成),以便我可以重新定位原始文档中的文本 - 如果有要求的话。

I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.

病女 2024-07-16 14:06:20

我能想到的最好的方法是在 python NLTK 中定义语法。 然而,对于你想要的东西来说,它可能会变得相当复杂。

我个人会使用正则表达式,同时通过一些编程生成排列列表。

The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.

I'd personnaly go for regular expressions while generating a list of permutations with some programming.

人海汹涌 2024-07-16 14:06:20

SQL ServerOracle 具有内置 SOUNDEX 函数。

此外,还有一个名为 DIFFERENCE 的 SQL Server 内置函数可供使用。

Both SQL Server and Oracle have built-in SOUNDEX Functions.

Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.

许仙没带伞 2024-07-16 14:06:20

纯旧的正则表达式脚本就可以完成这项工作。

使用Ruby,速度相当快。 阅读行并匹配单词。

干杯

pure old regular expression scripting will do the job.

use Ruby, it's quite fast. read lines and match words.

cheers

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文