NLP 中专有名词识别策略

发布于 2024-07-15 00:57:08 字数 221 浏览 12 评论 0原文

我有兴趣了解有关自然语言处理 (NLP) 的更多信息,并且很好奇目前是否有有什么策略可以识别文本中不基于字典识别的专有名词? 另外,任何人都可以解释或链接到解释当前基于字典的方法的资源吗? 谁是 NLP 领域的权威专家,或者该主题的权威资源有哪些?

I'm interested in learning more about Natural Language Processing (NLP) and am curious if there are currently any strategies for recognizing proper nouns in a text that aren't based on dictionary recognition? Also, could anyone explain or link to resources that explain the current dictionary-based methods? Who are the authoritative experts on NLP or what are the definitive resources on the subject?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

柒七 2024-07-22 00:57:09

除了基于字典的方法之外,我想到了另外两种方法:

  • 基于模式的方法(简单的形式:任何大写的都是专有名词)
  • 机器学习方法(在训练语料库中标记专有名词并训练分类器

)字段通常称为命名实体提取,通常被视为信息提取的子字段。 NLP 不同领域的一个很好的起点通常是牛津计算语言学手册中的相应章节:

牛津计算语言学手册
(来源:oup.com

Besides the dictionary-based approach, two others come to my mind:

  • Pattern-based approaches (in a simple form: anything that is capitalized is a proper noun)
  • Machine learning approaches (mark proper nouns in a training corpus and train a classifier)

The field is mostly called named-entity extraction and often considered a subfield of information extraction. A good starting point for the different fields of NLP is usually the according chapter in the Oxford Handbook of Computational Linguistics:

Oxford Handbook of Computational Linguistics
(source: oup.com)

赠佳期 2024-07-22 00:57:09

这取决于您所说的基于字典的含义。

例如,一种策略是采用字典中没有的内容,并尝试假设它们是专有名词。 如果这导致了合理的解析,请考虑暂时验证的假设并继续进行,否则得出结论:它们不是。

其他想法:

  • 在主语位置上,任何没有限定词的简单主语都是不错的候选者。
  • 介词短语中的情况也是如此
  • 在任何位置,所有格限定词的基础(例如“鲍勃的妹妹”中的鲍勃)都是一个很好的候选者

-- MarkusQ

It depends on what you mean by dictionary-based.

For example, one strategy would be to take things that aren't in a dictionary and try to proceed on the assumption that they're proper nouns. If this leads to a sensible parse, consider the assumption provisionally validated and keep going, otherwise conclude that they aren't.

Other ideas:

  • In subject position, any simple subject without a determiner is a good candidate.
  • Ditto in prepositional phrases
  • In any position, the basis of a possessive determiner (e.g. Bob in "Bob's sister") is a good candidate

-- MarkusQ

无声静候 2024-07-22 00:57:09

尝试搜索“命名实体识别”——这是 NLP 文献中用于此类事物的术语。

Try searching for "named entity recognition"--that's the term that's used in the NLP literature for this sort of thing.

昔梦 2024-07-22 00:57:09

如果你有诸如“谁是比尔·盖茨”之类的句子
如果您对其应用词性标记器。
它将给出答案

“谁/WP 是/VBZ bill/NN 盖茨/NNS?/。”

你可以在网上尝试这个
http://cst.dk/online/pos_tagger/uk/

所以你得到的是这句话中的所有名词。 现在您可以使用某种算法轻松提取该名词。 如果您正在使用自然语言处理,我建议使用 python。 它有 NLTK(自然语言工具包)可供您使用。

if you have sentence such as "who is bill gates"
And if you apply part of speech tagger to it.
It will give answer as

"who/WP is/VBZ bill/NN gates/NNS ?/. "

U can try this online on
http://cst.dk/online/pos_tagger/uk/

So you are getting what are all the nouns in this sentence. Now you can easily extract this nouns with some algorithm. I suggest to use python if you are using natural language processing. It has NLTK(Natural language toolkit) with which you can work.

酷炫老祖宗 2024-07-22 00:57:09

一些工具包建议:
1. Opennlp:有一个适合您任务的命名实体识别组件
2. LingPipe:也是它的NER组件
3.Stanford NLP 包:非常适合学术用途,但可能不适合商业用途。
4. nltk:Python NLP 包

some toolkits suggested:
1. Opennlp: there is a Named Entity Recognition component for your task
2. LingPipe: also a NER component for it
3. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly.
4. nltk: a Python NLP package

塔塔猫 2024-07-22 00:57:09

如果您对自然语言处理的实现感兴趣并且 Python 是您的编程语言,那么这可能是一个信息非常丰富的资源:http://www.youtube.com/watch?v=kKe4M4iSclc

If you're interested in the implementation of natural language processing and python is your programming language, then this can be a very informative resource: http://www.youtube.com/watch?v=kKe4M4iSclc

自在安然 2024-07-22 00:57:09

虽然这是针对孟加拉语的,但是它可以得出一个通用的程序来识别专有名词。 所以我希望这对你有帮助。
请检查以下链接:
http://www.mecs-press.org/ijmecs /ijmecs-v6-n8/v6n8-1.html

Though this is for Bengali language, but it can draw a common procedure identified proper noun. So I hope this will be helpful for you.
Please check the folowing link:
http://www.mecs-press.org/ijmecs/ijmecs-v6-n8/v6n8-1.html

糖果控 2024-07-22 00:57:08

确定文本中单词的正确词性的任务称为词性标记 。 例如,Brill 标记器使用字典(词汇)单词和上下文规则的混合。 我相信这个任务的一些重要的初始词典单词是停用词。
一旦你有了(大部分正确的)单词的词性,你就可以开始构建更大的结构。 这本面向行业的书区分了识别名词短语 (NP) 和识别命名实体。
关于教科书:艾伦的自然语言理解是一本很好的书,但有点过时了。 统计自然语言处理基础是对统计 NLP 的很好的介绍。 语音和语言处理更加严格,也许更权威。 计算语言学协会是计算语言学领域的领先科学团体。

The task of determining the proper part of speech for a word in a text is called Part of Speech Tagging. The Brill tagger, for example, uses a mixture of dictionary(vocabulary) words and contextual rules. I believe that some of the important initial dictionary words for this task are the stop words.
Once you have (mostly correct) parts of speech for your words, you can start building larger structures. This industry-oriented book differentiates between recognizing noun phrases (NPs) and recognizing named entities.
About textbooks: Allen's Natural Language Understanding is a good, but a bit dated, book. Foundations of Statistical Natural Language Processing is a nice introduction to statistical NLP. Speech and Language Processing is a bit more rigorous and maybe more authoritative. The Association for Computational Linguistics is a leading scientific community on computational linguistics.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文