斯坦福 NER 工具包 - 小写实体识别

发布于 2024-10-03 13:02:16 字数 277 浏览 8 评论 0原文

我是 NLP 的新手，试图弄清楚命名实体识别器如何注释命名实体。我正在试验斯坦福大学的 NER 工具包。当我在标准的更正式的数据集（遵循所有命名约定来表示命名实体（例如新闻专线或新闻博客中））上使用 NER 时，NER 会正确注释实体。然而，当我使用 twitter 等非正式数据集运行 NER 时，命名实体可能没有按应有的大写形式，NER 不会对实体进行注释。我使用的分类器是 3-CRF 序列化分类器。谁能告诉我如何让 NER 也识别小写实体？任何有关如何破解 NER 以及如何进行改进的有用建议都将受到高度赞赏。预先感谢您的所有帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

明媚如初 2024-10-10 13:02:16

我知道这是一个旧线程，但希望它能对某人有所帮助。正如克里斯托弗·曼宁（christopher manning）回答的那样，检测小写字母的方法是将 english.muc.7class.distsim.crf.ser.gz 替换为 english.muc.7class.caseless.distsim.crf.ser.gz，您可以在解压缩核心 nlp caseless jar 文件。

例如，在我的 python 文件中，除了更改为新文件之外，我保留了所有内容，并且它运行完美（好吧，大多数时候）

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')

I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.

For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)

st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')

回复收藏 0 原文