构建词性标注器(POS 标注器)
我需要用 Java 构建一个 POS 标注器,并且需要知道如何开始。是否有代码示例或其他资源可以帮助说明词性标注器的工作原理?
I need to build a POS tagger in Java and need to know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试 Apache OpenNLP。它包括 POS Tagger 工具。您可以从此处下载即用型英文模型。
该文档提供了有关如何从 Java 应用程序使用它的详细信息。基本上,您需要以下内容:
加载 POS 模型
实例化 POS 标记器
执行它
请注意,POS 标记器需要一个标记化的句子。 Apache OpenNLP 还提供了工具和模型来帮助完成这些任务。
如果您必须训练自己的模型,请参阅此 文档。
Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.
The documentation provides details about how to use it from a Java application. Basically you need the following:
Load the POS model
Instantiate the POS tagger
Execute it
Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.
If you have to train your own model refer to this documentation.
您可以检查现有的标记器实现。
例如,请参考 Java 中的斯坦福大学 POS 标记器(由 Kristina Toutanova 编写),它可在 GNU 通用公共许可证(v2 或更高版本)下使用,源代码编写良好且记录清晰:
http://nlp.stanford.edu/software/tagger.shtml
关于标记的好书是:
语音和语言处理(第二版)作者:Daniel Jurafsky、James H. Martin
You can examine existing taggers implementations.
Refer for example to Stanford University POS tagger in Java (by Kristina Toutanova), it is available under GNU General Public License (v2 or later), source code is well written and clearly documented:
http://nlp.stanford.edu/software/tagger.shtml
Good book to read about tagging is:
Speech and Language Processing (2nd Edition) by Daniel Jurafsky, James H. Martin
有一些广泛使用的 POS/NER 标注器。
OpenNLP Maxent POS 标注器:使用 Apache OpenNLP。
Open NLP 是 Apache 提供的功能强大的 java NLP 库。它为 NLP 提供了各种工具,其中之一是词性(POS)标记器。通常 POS 标记器用于找出文本中的语法结构,您使用标记数据集,其中每个单词(短语的一部分)都标有标签,您可以根据该数据集构建 NLP 模型,然后对于新文本,您可以使用模型为文本中的每个单词生成标签。
示例代码:
详细博客,其中包含有关如何使用它的完整代码:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so
基于斯坦福 CoreNLP 的 NER tagger:
斯坦福核心 NLP 是迄今为止经过最久经考验的 NLP 库。在某种程度上,它是当今 NLP 性能的黄金标准。在各种其他功能中,库支持命名实体识别(NER),这允许标记一段文本中的重要实体,例如人名、地点等。
示例代码:
详细博客包含有关如何使用它的完整代码:
https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so
There are a few POS/NER taggers used widely.
OpenNLP Maxent POS taggers: Using Apache OpenNLP.
Open NLP is a powerful java NLP library from Apache. It provides various tools for NLP one of which is Parts-Of-Speech (POS) tagger. Usually POS taggers are used to find out structure grammatical structure in text, you use a tagged dataset where each word (part of a phrase) is tagged with a label, you build an NLP model from this dataset and then for a new text you can use the model to generate tags for each word in the text.
Sample code:
Detailed blog with the full code on how to use it:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so
Stanford CoreNLP based NER tagger:
Stanford core NLP is by far the most battle-tested NLP library out there. In a way, it is the golden standard of NLP performance today. Among various other functionalities, named entity recognization (NER) is supported in the library, what this allows is to tag important entities in a piece of text like the name of a person, place etc.
Sample code:
Detailed blog with the full code on how to use it:
https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so