当前位置：文江博客话题详情

如何为 stanford tagger 创建自己的训练语料库？

发布于 2024-09-08 10:42:52 字数 127 浏览 9 评论 0原文

我必须用大量的速记和当地行话来分析非正式的英语文本。因此，我正在考虑为斯坦福标记器创建模型。

如何创建自己的一组标记语料库供斯坦福标记器进行训练？

语料库的语法是什么？我的语料库应该有多长才能达到理想的性能？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

叫嚣ゝ 2024-09-15 10:42:52

要训练 PoS 标记器，请参阅此邮件列表帖子它也包含在 JavaDocs 对于 MaxentTagger 类。

edu.stanford 的 javadoc .nlp.tagger.maxent.Train class指定训练格式：

训练文件应该位于
以下格式：一个单词和一个标签
每行用空格或
选项卡。每个句子应该以
EOS 单词标签对。（其实我不是
完全确定情况仍然如此，
但这可能不会造成伤害。 -w摩根）

回复收藏 0 原文

遗心遗梦遗幸福 2024-09-15 10:42:52

本质上，您为训练过程设置的文本格式应在每一行上有一个标记，后跟一个制表符，然后是标识符。标识符可以是类似于位置的“LOC”、公司的“COR”或非实体令牌的“0”。例如，

I     0
left     0
my     0
heart     0
in     0
Kansas     LOC
City     LOC
.     0

当我们的团队训练一系列分类器时，我们为每个分类器提供一个这样格式的训练文件，其中包含大约 180,000 个标记，我们看到精确度的净提高，但召回率的净下降。（值得注意的是，精度的提高在统计上并不显着。）为了防止对其他人有用，我描述了我们用于训练分类器的过程以及经过训练和默认的 p、r 和 f1 值分类器此处。

Essentially, the texts that you format for the training process should have one token on each line, followed by a tab, followed by an identifier. The identifier may be something like "LOC" for location, "COR" for corporation, or "0" for non-entity tokens. E.g.

I     0
left     0
my     0
heart     0
in     0
Kansas     LOC
City     LOC
.     0

When our team trained a series of classifiers, we fed each a training file formatted like this with roughly 180,000 tokens, and we saw a net improvement in precision but a net decrease in recall. (It bears noting that the increase in precision was not statistically significant.) In case it might be useful to others, I described the process we used to train the classifier as well as the p, r, and f1 values of both trained and default classifiers here.

回复收藏 0 原文

许仙没带伞 2024-09-15 10:42:52

对于斯坦福解析器，您使用 Penn Treebank 格式，并参见斯坦福常见问题解答，了解要使用的确切命令。 LexicalizedParser 类也给出了适当的命令，特别是：

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
   -train trainFilesPath fileRange
   -saveToSerializedFile serializedGrammarFilename

For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
   -train trainFilesPath fileRange
   -saveToSerializedFile serializedGrammarFilename

回复收藏 0 原文

暖伴 2024-09-15 10:42:52

我试过：
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
-train trainFilesPath 文件范围
-saveToSerializedFile serializedGrammarFilename

但我遇到了错误：

错误：无法找到或加载主类 edu.stanford.nlp.parser.lexparser.LexicalizedParser

回复收藏 0 原文

~没有更多了~