如何使用 python 和 python 从 Penn Treebank 获取一组语法规则NLTK?
我对 NLTK 和 Python 还很陌生。我一直在使用示例中给出的玩具语法来创建句子解析,但我想知道是否可以使用从 Penn Treebank 的一部分学到的语法,而不是仅仅编写自己的语法或使用玩具语法? (我在 Mac 上使用 Python 2.7) 非常感谢
I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I would like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac)
Many thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
可以在 treebank_chunk 或 conll2000 语料库上训练 Chunker。你不会从中得到语法,但你确实得到了一个可以解析短语块的可腌制对象。请参阅如何训练 NLTK Chunker, 块提取NLTK 和NLTK 基于分类的分块器准确性。
It is possible to train a Chunker on the treebank_chunk or conll2000 corpora. You don't get a grammar out of it, but you do get a pickle-able object that can parse phrase chunks. See How to Train a NLTK Chunker, Chunk Extraction with NLTK, and NLTK Classified Based Chunker Accuracy.
如果您想要一个精确捕获 NLTK 附带的 Penn Treebank 示例的语法,您可以这样做,假设您已经下载了 NLTK 的 Treebank 数据(请参阅下面的评论):
但是,这可能不会为您提供有用的东西。由于 NLTK 仅支持使用指定所有终端的语法进行解析,因此您只能解析包含 Treebank 示例中的单词的句子。
此外,由于树库中许多短语的扁平结构,这种语法对于未包含在训练中的句子的泛化效果非常差。这就是为什么尝试解析树库的 NLP 应用程序没有使用从树库学习 CFG 规则的方法。最接近的技术是 Ren Bods 面向数据的解析方法,但它要复杂得多。
最后,这会慢得令人难以置信,毫无用处。因此,如果您想看到这种方法在单个句子的语法上的作用,只是为了证明它有效,请尝试以下代码(在上面的导入之后):
If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):
This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.
Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.
Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):