从文本中提取专有名称的软件和技术有哪些?

发布于 2024-10-13 07:04:07 字数 129 浏览 6 评论 0原文

我有一个基于文本的文档的大型语料库(100,000+),我想从中提取专有名称(例如人名)。

任何人都可以推荐有助于实现此目标的技术和/或软件。我对低级文本解析并不是特别感兴趣,而是对更高级的事情(例如识别和/或排名)特别感兴趣。

I have a large corpus of text-based documents (100,000+) from which I want to extract proper names (e.g. a person's name).

Could anyone recommend techniques and/or software that would be useful in accomplishing this goal. I'm not particularly interested in low-level text parsing, so much as I am in more high-level things such as recognizing and/or ranking.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

埋情葬爱 2024-10-20 07:04:07

您在寻找命名实体识别吗?请查看维基百科文章。

斯坦福 NLP 小组此处有一个不错的即用型软件包,其中包含提供 GPL 和商业许可证。

Are you looking for Named Entity Recognition? Take a look at the wikipedia article.

The Stanford NLP group has a decent ready-to-use package here, with both GPL and commerical licenses available.

温柔戏命师 2024-10-20 07:04:07

如果没有某种形式的自然语言处理,这样的事情就无法可靠地完成。一些常见问题:

  • 名称也是常用词:John Black

  • 多种语言和同一单词的各种形式。

  • 指代不同事物的名称。 Lily 可以是人名、地名、猫名或花名。

NLP 可以使用周围的语法结构来区分其中一些情况。

也就是说,您可以尝试的一种简单(且幼稚)的技术是使用单词的大写。如果您在句子中间看到大写开头字母,它通常是某种名称。

您可以合理地假设任何此类单词在同一文档中都指代相同的事物。序列中的两个这样的单词可能是名字/姓氏组合等。

如果文档中的大写字母不可信,您可能可以信任正确的单词列表,以便获取适用的正确名称列表语言。

Something like this cannot be done reliably without some form of Natural Language Processing. A few common issues:

  • Names that are also common words: John Black

  • Multiple languages and various forms of the same word.

  • Names that refer to different things. Lily could be a name for a person, a place, a cat or just the flower.

NLP can use surrounding grammar constructs to tell some of these cases apart.

That said, a simple (and naive) technique that you could try would be to use the capitalisation of the words. If you see a capital starting letter in the middle of a sentence, it is usually a name of some sort.

You might be able to reasonably assume that any such word refers to the same thing within the same document. Two such words in a sequence are probably a name/surname combination etc.

If capitalisation in the documents cannot be trusted, you might be able to trust that of a proper wordlist, instead, in order to get a list of proper names for the applicable languages.

寄与心 2024-10-20 07:04:07

也许你最好的选择是将每个单词与专有名称词典进行比较。

Probably your best bet is to compare each word against a dictionary of proper names.

若言繁花未落 2024-10-20 07:04:07

如果您列出了所有唯一单词,然后删除了字典中的所有单词,该怎么办?

What if you made a list of all of the unique words, then removed all of the words that are in a dictionary?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文