使用 NLTK 将分词器组合成语法和解析器

发布于 2024-10-15 03:04:03 字数 712 浏览 5 评论 0原文

我正在阅读 NLTK 书，但我似乎无法做一些看似构建体面语法的自然第一步的事情。

我的目标是为特定的文本语料库构建语法。

（最初的问题：我是否应该尝试从头开始学习语法，还是应该从预定义的语法开始？如果我应该从另一种语法开始，对于英语来说，哪种语法比较好？）

假设我有以下简单的语法：

simple_grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
Det -> 'a' | 'A'
N -> 'car' | 'door'
V -> 'has'
P -> 'in' | 'for'
 """);

这个语法可以解析一个非常简单的句子，例如：

parser = nltk.ChartParser(simple_grammar)
trees = parser.nbest_parse("A car has a door")

现在我想扩展这个语法来处理带有其他名词和动词的句子。如何将这些名词和动词添加到我的语法中，而无需在语法中手动定义它们？

例如，假设我希望能够解析句子“汽车有轮子”。我知道提供的分词器可以神奇地找出哪些单词是动词/名词等。如何使用分词器的输出告诉语法“wheels”是名词？

原文

I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.

My goal is to build a grammar for a particular text corpus.

(Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?)

Suppose I have the following simple grammar:

simple_grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
Det -> 'a' | 'A'
N -> 'car' | 'door'
V -> 'has'
P -> 'in' | 'for'
 """);

This grammar can parse a very simple sentence, such as:

parser = nltk.ChartParser(simple_grammar)
trees = parser.nbest_parse("A car has a door")

Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?

For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

短暂陪伴 2024-10-22 03:04:03

您可以在文本上运行词性标注器，然后调整语法以适用于词性标记而不是单词。

> text = nltk.word_tokenize("A car has a door")
['A', 'car', 'has', 'a', 'door']

> tagged_text = nltk.pos_tag(text)
[('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]

> pos_tags = [pos for (token,pos) in nltk.pos_tag(text)]
['DT', 'NN', 'VBZ', 'DT', 'NN']

> simple_grammar = nltk.CFG.fromstring("""
  S -> NP VP
  PP -> P NP
  NP -> Det N | Det N PP
  VP -> V NP | VP PP
  Det -> 'DT'
  N -> 'NN'
  V -> 'VBZ'
  P -> 'PP'
  """)

> parser = nltk.ChartParser(simple_grammar)
> tree = parser.parse(pos_tags)

You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.

> text = nltk.word_tokenize("A car has a door")
['A', 'car', 'has', 'a', 'door']

> tagged_text = nltk.pos_tag(text)
[('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]

> pos_tags = [pos for (token,pos) in nltk.pos_tag(text)]
['DT', 'NN', 'VBZ', 'DT', 'NN']

> simple_grammar = nltk.CFG.fromstring("""
  S -> NP VP
  PP -> P NP
  NP -> Det N | Det N PP
  VP -> V NP | VP PP
  Det -> 'DT'
  N -> 'NN'
  V -> 'VBZ'
  P -> 'PP'
  """)

> parser = nltk.ChartParser(simple_grammar)
> tree = parser.parse(pos_tags)

回复收藏 0 原文

哆啦不做梦 2024-10-22 03:04:03

我知道这是一年后的事了，但我想补充一些想法。

我采用了很多不同的句子，并用我正在从事的项目的词性来标记它们。从那里，我按照 StompChicken 的建议进行操作，从元组（单词、标签）中提取标签，并将这些标签用作“终端”（当我们创建一个完全标记的句子时，树的底部节点）。

最终，这不符合我在名词短语中标记中心名词的愿望，因为我无法将中心名词“单词”拉入语法中，因为语法只有标签。

因此，我所做的是使用（单词，标签）元组集来创建标签字典，其中带有该标签的所有单词作为该标签的值。然后我将此字典打印到 screen/grammar.cfg（上下文无关语法）文件中。

我用来打印它的表单与通过加载语法文件（parser = nltk.load_parser('grammar.cfg')）设置解析器完美配合。它生成的其中一行如下所示：

VBG -> “击剑”| “敲击”| “金额”| “living”...超过 30 个单词...

所以现在我的语法将实际单词作为终结符，并分配与 nltk.tag_pos 相同的标签。

希望这可以帮助任何想要自动标记大型语料库并且仍然将实际单词作为语法中的终端的人。

import nltk
from collections import defaultdict

tag_dict = defaultdict(list)

...
    """ (Looping through sentences) """

        # Tag
        tagged_sent = nltk.pos_tag(tokens)

        # Put tags and words into the dictionary
        for word, tag in tagged_sent:
            if tag not in tag_dict:
                tag_dict[tag].append(word)
            elif word not in tag_dict.get(tag):
                tag_dict[tag].append(word)

# Printing to screen
for tag, words in tag_dict.items():
    print tag, "->",
    first_word = True
    for word in words:
        if first_word:
            print "\"" + word + "\"",
            first_word = False
        else:
            print "| \"" + word + "\"",
    print ''

I know this is a year later but I wanted to add some thoughts.

I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).

Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.

So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.

The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')). One of the lines it generates looks like this:

VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...

So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos does.

Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.

import nltk
from collections import defaultdict

tag_dict = defaultdict(list)

...
    """ (Looping through sentences) """

        # Tag
        tagged_sent = nltk.pos_tag(tokens)

        # Put tags and words into the dictionary
        for word, tag in tagged_sent:
            if tag not in tag_dict:
                tag_dict[tag].append(word)
            elif word not in tag_dict.get(tag):
                tag_dict[tag].append(word)

# Printing to screen
for tag, words in tag_dict.items():
    print tag, "->",
    first_word = True
    for word in words:
        if first_word:
            print "\"" + word + "\"",
            first_word = False
        else:
            print "| \"" + word + "\"",
    print ''

回复收藏 0 原文

挖鼻大婶 2024-10-22 03:04:03

解析是一个棘手的问题，很多事情都可能出错！

这里（至少）需要三个组件，一个分词器、一个标记器，最后是解析器。

首先，您需要将运行文本标记为标记列表。这就像将输入字符串拆分为空格一样简单，但如果您要解析更一般的文本，您还需要处理数字和标点符号，这并不简单。例如，句子结尾的句点通常不被视为其所附单词的一部分，但标记缩写的句点通常被视为。

当您有输入标记列表时，您可以使用标记器来尝试确定每个单词的 POS，并使用它来消除输入标记序列的歧义。这有两个主要优点：首先，它加快了解析速度，因为我们不再需要考虑由不明确的单词许可的替代假设，因为词性标注器已经做到了这一点。其次，它改进了未知单词的处理，即。语法中不存在的单词，还可以为这些单词分配一个标签（希望是正确的标签）。以这种方式组合解析器和标记器是很常见的。

然后，词性标签将构成语法中的前置终结符。前置终结符是产生式的左侧，只有终结符作为其右侧。即在N->中“房子”，V-> N 和 V 是前置终结符。具有句法的语法是相当常见的，双方只有非终结符、产生式和词汇产生式，一个非终结符指向一个终结符。这在大多数情况下都具有语言意义，并且大多数 CFG 解析器要求语法采用这种形式。然而，人们可以通过从 RHS 中任何带有非终端的终端创建“虚拟产生式”来以这种方式表示任何 CFG。

如果您想在语法中比标记器输出的内容进行更多（或更少）细粒度的标记区分，则可能需要在词性标记和预终结符之间进行某种映射。然后，您可以使用标记器的结果初始化图表，即。跨越每个输入标记的适当类别的被动项目。遗憾的是我不知道 NTLK，但我确信有一个简单的方法可以做到这一点。当图表被播种时，解析可以正常继续，并且可以以常规方式提取任何解析树（也包括单词）。

然而，在大多数实际应用中，您会发现解析器可以返回几种不同的分析，因为自然语言是高度模糊的。我不知道您要解析哪种文本语料库，但如果它是类似自然语言的内容，您可能必须构建某种解析选择模型，这将需要一个树库，即解析树的集合大小从几百到几千个解析不等，这一切都取决于您的语法以及您需要的结果的准确程度。给定这个树库，人们可以自动推断出与其相对应的 PCFG。然后，PCFG 可以用作对解析树进行排名的简单模型。

所有这些都需要您自己做很多工作。你用解析结果做什么？您是否查看过 NTLK 或其他软件包（例如 StanleyParser 或 BerkeleyParser）中的其他资源？

Parsing is a tricky problem, alot of things can go wrong!

You want (at least) three components here, a tokenizer, a tagger and finally the parser.

First you need to tokenize the running text into a list of tokens. This can be as easy as splitting the input string around whitespace, but if you are parsing more general text you will also need to handle numbers and punctuation, which is non trivial. For instance sentence ending periods are often not regarded as part of the word it is attached to, but periods marking an abbreviation often are.

When you have a list of input tokens you can use a tagger to try to determine the POS of each word, and use it to disambiguate input tag sequences. This has two main advantages: First it speeds up parsing as we no longer have to consider alternate hypothesis licensed by ambiguous words, as the POS-tagger has already done this. Second it improves unknown word handling, ie. words not in your grammar, by also assigning those words a tag (hopefully the right one). Combining a parser and a tagger in this way is commonplace.

The POS-tags will then make up the pre-terminals in your grammar, The pre-terminals are the left-hand sides of productions with only terminals as their right-hand side. Ie in N -> "house", V -> "jump" etc. N and V are preterminals. It is fairly common to have the grammar with syntactic, only non-terminals on both-sides, productions and lexical productions, one non-terminal going to one terminal. This makes linguistic sense most of the time, and most CFG-parsers require the grammar to be in this form. However one could represent any CFG in this way by creating "dummy productions" from any terminals in RHSes with non-terminals in them.

It could be neccesary to have some sort of mapping between POS-tags and pre-terminals if you want to make more (or less) fine grained tag distinctions in your grammar than what your tagger outputs. You can then initialize the chart with the results from the tagger, ie. passive items of the appropriate category spanning each input token. Sadly I do not know NTLK, but I'm sure there is a simple way to do this. When the chart is seeded, parsing can contiune as normal, and any parse-trees can be extracted (also including the words) in the regular fashion.

However, in most practical applications you will find that the parser can return several different analyses as natural language is highly ambiguous. I don't know what kind of text corpus you are trying to parse, but if it's anything like natural language you probably will have to construct some sort of parse-selection model, this will require a treebank,a collection of parse-trees of some size ranging from a couple of hundred to several thousand parses, all depending on your grammar and how accurate results you need. Given this treebank one can automagically infer a PCFG corresponding to it. The PCFG can then be used as a simple model for ranking the parse trees.

All of this is a lot of work to do yourself. What are you using the parse results for? Have you looked at other resources in the NTLK or other packages such as the StanfordParser or the BerkeleyParser?

回复收藏 0 原文

~没有更多了~