如何将 OpenNLP 与 Java 结合使用?
我想对一个英文句子进行 POStag 并进行一些处理。我想使用 openNLP。我已经安装了它
当我执行命令时
I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt
它给出输出 POSTaging Text.txt 中的输入
Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.
Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s
我希望它安装正确?
现在我如何从 java 应用程序内部执行此 POStaging 操作?我已将 openNLPtools、jwnl、maxent jar 添加到项目中,但如何调用 POStagging?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
下面是我整理的一些(旧的)示例代码,以及后续的现代化代码:
输出是:
这基本上是在作为 OpenNLP 一部分包含的 POSTaggerTool 类中工作的。
sample.getTags()
是一个String
数组,其中包含标签类型本身。这需要直接文件访问训练数据,这真的非常蹩脚。
更新后的代码库略有不同(并且可能更有用)。
首先,Maven POM:
这是作为测试编写的代码,因此位于
./src/test/java/org/javachannel/ opennlp/example
:这段代码实际上并没有测试任何东西 - 如果有的话,它只是一个冒烟测试 - 但它应该作为一个起点。另一个(可能)好的事情是,如果您尚未下载模型,它会为您下载模型。
Here's some (old) sample code I threw together, with modernized code to follow:
The output is:
This is basically working from the POSTaggerTool class included as part of OpenNLP. The
sample.getTags()
is aString
array that has the tag types themselves.This requires direct file access to the training data, which is really, really lame.
An updated codebase for this is a little different (and probably more useful.)
First, a Maven POM:
And here's the code, written as a test, therefore located in
./src/test/java/org/javachannel/opennlp/example
:This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.
网址 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html 不再起作用了。我在第 14 张幻灯片上找到了以下内容 http://www.slideshare.net/gagan1667/opennlp-演示
The URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html does not work anymore. I found the below on the 14th slide at http://www.slideshare.net/gagan1667/opennlp-demo
上面的答案确实提供了一种使用 OpenNLP 现有模型的方法,但如果您需要训练自己的模型,也许下面的内容可以提供帮助:
这是带有完整代码的详细教程:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
根据您的领域,您可以自动或手动构建数据集。手动构建这样的数据集可能非常痛苦,像 POS tagger 这样的工具可以帮助使这个过程变得更加容易。
训练数据格式
训练数据作为文本文件传递,其中每一行都是一个数据项。行中的每个单词应采用“word_LABEL”格式进行标记,单词和标签名称之间用下划线“_”分隔。
训练模型
这里重要的类是 POSModel,它保存实际的模型。我们使用类 POSTaggerME 来进行模型构建。下面是从训练数据文件构建模型的代码
使用模型进行标记。
最后,我们可以看到如何使用模型来标记未见过的查询:
The above answer does provide a way to use the existing models from OpenNLP but if you need to train your own model, maybe the below can help:
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries: