如何为 stanford tagger 创建自己的训练语料库?
我必须用大量的速记和当地行话来分析非正式的英语文本。因此,我正在考虑为斯坦福标记器创建模型。
如何创建自己的一组标记语料库供斯坦福标记器进行训练?
语料库的语法是什么?我的语料库应该有多长才能达到理想的性能?
I have to analyze informal english text with lots of short hands and local lingo. Hence I was thinking of creating the model for the stanford tagger.
How do i create my own set of labelled corpus for the stanford tagger to train on?
What is the syntax of the corpus and how long should my corpus be in order to achieve a desirable performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
要训练 PoS 标记器,请参阅此邮件列表帖子它也包含在 JavaDocs 对于 MaxentTagger 类。
edu.stanford 的 javadoc .nlp.tagger.maxent.Train class指定训练格式:
To train the PoS tagger, see this mailing list post which is also included in the JavaDocs for the MaxentTagger class.
The javadocs for the edu.stanford.nlp.tagger.maxent.Train class specifies the training format:
本质上,您为训练过程设置的文本格式应在每一行上有一个标记,后跟一个制表符,然后是标识符。标识符可以是类似于位置的“LOC”、公司的“COR”或非实体令牌的“0”。例如,
当我们的团队训练一系列分类器时,我们为每个分类器提供一个这样格式的训练文件,其中包含大约 180,000 个标记,我们看到精确度的净提高,但召回率的净下降。 (值得注意的是,精度的提高在统计上并不显着。)为了防止对其他人有用,我描述了我们用于训练分类器的过程以及经过训练和默认的 p、r 和 f1 值分类器此处。
Essentially, the texts that you format for the training process should have one token on each line, followed by a tab, followed by an identifier. The identifier may be something like "LOC" for location, "COR" for corporation, or "0" for non-entity tokens. E.g.
When our team trained a series of classifiers, we fed each a training file formatted like this with roughly 180,000 tokens, and we saw a net improvement in precision but a net decrease in recall. (It bears noting that the increase in precision was not statistically significant.) In case it might be useful to others, I described the process we used to train the classifier as well as the p, r, and f1 values of both trained and default classifiers here.
对于斯坦福解析器,您使用 Penn Treebank 格式,并参见 斯坦福常见问题解答,了解要使用的确切命令。 LexicalizedParser 类 也给出了适当的命令,特别是:
For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:
我试过:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
-train trainFilesPath 文件范围
-saveToSerializedFile serializedGrammarFilename
但我遇到了错误:
错误:无法找到或加载主类 edu.stanford.nlp.parser.lexparser.LexicalizedParser
I tried:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
-train trainFilesPath fileRange
-saveToSerializedFile serializedGrammarFilename
But I had the error:
Error: Could not find or load main class edu.stanford.nlp.parser.lexparser.LexicalizedParser