斯坦福 NER 工具包 - 小写实体识别

发布于 2024-10-03 13:02:16 字数 277 浏览 0 评论 0原文

我是 NLP 的新手,试图弄清楚命名实体识别器如何注释命名实体。我正在试验斯坦福大学的 NER 工具包。当我在标准的更正式的数据集(遵循所有命名约定来表示命名实体(例如新闻专线或新闻博客中))上使用 NER 时,NER 会正确注释实体。然而,当我使用 twitter 等非正式数据集运行 NER 时,命名实体可能没有按应有的大写形式,NER 不会对实体进行注释。我使用的分类器是 3-CRF 序列化分类器。谁能告诉我如何让 NER 也识别小写实体?任何有关如何破解 NER 以及如何进行改进的有用建议都将受到高度赞赏。预先感谢您的所有帮助。

I am a newbie to NLP and trying to figure out how a Named Entity Recognizer annotates named entities. I am experimenting with Stanford NER toolkit. When I use the NER on standard more formal datasets where all naming conventions are followed to represent named entities such as in newswires or news blogs, the NER annotates the entities correctly. However when I run NER with informal datasets such as twitter, where named entities might not be capitalized as should have been, The NER does not annotate the entities. The classifier that I am using is a 3-CRF serialised classifer. Can anybody let me know how can I make the NER recognize lower case entities too?? Any useful suggestions on how to hack the NER and where this improvement is to be done is greatly appreciated. Thanks in advance for all your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

明媚如初 2024-10-10 13:02:16

我知道这是一个旧线程,但希望它能对某人有所帮助。正如克里斯托弗·曼宁(christopher manning)回答的那样,检测小写字母的方法是将 english.muc.7class.distsim.crf.ser.gz 替换为 english.muc.7class.caseless.distsim.crf.ser.gz,您可以在解压缩核心 nlp caseless jar 文件。

例如,在我的 python 文件中,除了更改为新文件之外,我保留了所有内容,并且它运行完美(好吧,大多数时候)

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')

I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.

For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')
爱,才寂寞 2024-10-10 13:02:16

恐怕没有一种简单的方法可以让我们分发的训练模型在运行时忽略案例信息。所以,是的,他们通常只会标记大写的名称。可以训练一个无大小写的模型,该模型可以合理地工作(但在有大小写的文本上效果不佳,因为大小写在英语中是一个重要线索(但在德语、中文、阿拉伯语等中则不是)。

I'm afraid there isn't an easy way to get the trained models we distribute to ignore case information at runtime. So, yes, they'll usually only label capitalized names. It would be possible to train a caseless model, which would work reasonably (but not as well on cased text, since case is a big clue in English (but not in German, Chinese, Arabic, etc.).

情深已缘浅 2024-10-10 13:02:16

连同其他人的建议。如果您使用基于特征的分类器,我肯定会在人名中添加 100-200 个最常见的 3-4 个字母子字符串,或者在一个公认的特征下制作一个地名词典。有些模式必然会在人名中大量出现,而在其他类型的单词中则不会经常出现,例如“eli”。

Along with other people's suggestions. If you're using a feature-based classifier, I would definitely add in the 100-200 most common 3-4 letter substrings in people's names or making a gazzeteer under one recognized feature. There are certain patterns that are bound to show up quite a bit in personal names that don't show up very often in other types of words, like "eli."

朮生 2024-10-10 13:02:16

我认为 Twitter 对于这个应用程序来说将会非常困难。大写字母是一个重要的线索,正如你所说,Twitter 上经常缺少它。通过字典检查来删除有效的英语单词的作用有限,因为 Twitter 文本包含大量缩写,而且它们通常是唯一的。

也许词性标记和频率分析都可以用来帮助改进专有名词的检测?

I think Twitter is going to be very difficult for this application. Capital letters are a big clue which, as you say, are often missing on Twitter. A dictionary check to remove valid English words is of limited use because Twitter texts include a huge number of abbreviations and they're often unique.

Perhaps PArt of Speech tagging, and frequency analysis can both be used to help improve detection of proper nouns?

挽你眉间 2024-10-10 13:02:16

这个问题有点老了,但其他人也许可以从这个想法中受益。

训练小写字母分类器的一种方法是针对大量正确英语的数据集运行已有的大写字母分类器,然后处理标记文本以删除大小写。然后你就有了一个带标签的语料库,可以用来训练新的分类器。由于推文的特殊性,这个新的分类器不会完美地对抗 Twitter,但它是一种快速引导它的方法。

The question is a bit old, but somebody else may be able to benefit from this idea.

One way to potentially train a classifier for lower case would be to run the upper case classifier that you already have against a large data set of proper English, then process that tagged text to remove case. Then you have a tagged corpus that you can use to train a new classifier. This new classifier won't be perfect against Twitter because of the peculiarities of tweets, but it's a quick way to bootstrap it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文