如何使用 python 和 python 从 Penn Treebank 获取一组语法规则NLTK？

发布于 2024-11-29 18:45:37 字数 139 浏览 0 评论 0原文

我对 NLTK 和 Python 还很陌生。我一直在使用示例中给出的玩具语法来创建句子解析，但我想知道是否可以使用从 Penn Treebank 的一部分学到的语法，而不是仅仅编写自己的语法或使用玩具语法？（我在 Mac 上使用 Python 2.7）非常感谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泅人 2024-12-06 18:45:38

可以在 treebank_chunk 或 conll2000 语料库上训练 Chunker。你不会从中得到语法，但你确实得到了一个可以解析短语块的可腌制对象。请参阅如何训练 NLTK Chunker，块提取NLTK 和NLTK 基于分类的分块器准确性。

回复收藏 0 原文

耶耶耶 2024-12-06 18:45:37

如果您想要一个精确捕获 NLTK 附带的 Penn Treebank 示例的语法，您可以这样做，假设您已经下载了 NLTK 的 Treebank 数据（请参阅下面的评论）：

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

但是，这可能不会为您提供有用的东西。由于 NLTK 仅支持使用指定所有终端的语法进行解析，因此您只能解析包含 Treebank 示例中的单词的句子。

此外，由于树库中许多短语的扁平结构，这种语法对于未包含在训练中的句子的泛化效果非常差。这就是为什么尝试解析树库的 NLP 应用程序没有使用从树库学习 CFG 规则的方法。最接近的技术是 Ren Bods 面向数据的解析方法，但它要复杂得多。

最后，这会慢得令人难以置信，毫无用处。因此，如果您想看到这种方法在单个句子的语法上的作用，只是为了证明它有效，请尝试以下代码（在上面的导入之后）：

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])

If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.

Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.

Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])

回复收藏 0 原文

~没有更多了~