OpenNLP 的德国 maxent 模型使用什么标签集?
目前,我正在使用 OpenNLP 工具对德语句子进行 PoS 标记,其 下载站点 上列出了 maxent 模型:
de POS Tagger Maxent model trained on tiger corpus. de-pos-maxent.bin
这非常有效,我得到的结果如下:
Diese, Community, bietet, Teilnehmern, der, Veranstaltungen, die, Möglichkeit ... PDAT, FM, VVFIN, NN, ART, NN, ART, NN ...
对于标记的句子,我想做一些进一步的处理,我必须知道单个标记的含义。不幸的是,在 OpenNLP-Wiki 中搜索标签集并不是很有帮助,正如它所说的那样:
TODO: Add more tag sets, also for non-english languages
有谁知道在哪里可以找到德国 maxent 模型中使用的标签集?
currently I am using the OpenNLP tools to PoS-tag german sentences, with the maxent model listed on their download-site:
de POS Tagger Maxent model trained on tiger corpus. de-pos-maxent.bin
This works very well and I got results as:
Diese, Community, bietet, Teilnehmern, der, Veranstaltungen, die, Möglichkeit ... PDAT, FM, VVFIN, NN, ART, NN, ART, NN ...
With the tagged sentences I want to do some further processing where I have to know the meaning of the single tags. Unforunately searching the OpenNLP-Wiki for the tag sets isn't very helpful as it says:
TODO: Add more tag sets, also for non-english languages
Does anyone know where can I find the tag set used in the german maxent model?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我创建了一个包含德语标签的枚举(可以进行反向查找):
I created an enum containing the german tags (Reverse lookup is possible):
看起来很可能 STTS 标签集<使用/a>。据说这个标签集是德语最常见的标签集,例如在这个 问题或在此维基百科条目。
It seems very likely that the STTS tag set is used. This tag set is said to be the most common tag set for the German language, e.g. in this question or in this Wikipedia entry.
据我了解,德语的 OpenNLP POS 标注器是在 Tiger 语料库上进行训练的。该语料库确实使用了 STTS 标签集,但做了一些微小的修改。我发现以下内容很有帮助:简要介绍老虎样本语料库
It is my understanding that the OpenNLP POS tagger for German was trained on the Tiger corpus. This corpus does indeed use the STTS tag set, with minor modifications. I found the following helpful: A Brief Introduction to the Tiger Sample Corpus