Python NLTK：如何使用简化的词性标记集来标记句子？

发布于 2024-11-03 22:51:05 字数 671 浏览 5 评论 0原文

Python NLTK 书籍的第 5 章给出了标记单词的示例一句话：

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

nltk.pos_tag 调用默认标记器，它使用完整的标记集。本章后面的简化的标记集。

如何使用这套简化的词性标记来标记句子？

我是否正确理解了标记器，即我可以按照我的要求更改标记器使用的标记集，还是应该将其返回的标记映射到简化集，或者应该从新的标记器创建一个新的标记器，简单标记语料库？

原文

Chapter 5 of the Python NLTK book gives this example of tagging words in a sentence:

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

nltk.pos_tag calls the default tagger, which uses a full set of tags. Later in the chapter a simplified set of tags is introduced.

How can I tag sentences with this simplified set of part-of-speech tags?

Also have I understood the tagger correctly, i.e. can I change the tag set that the tagger uses as I'm asking, or should I map the tags it returns on to the simplified set, or should I create a new tagger from a new, simply-tagged corpus?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

假装爱人 2024-11-10 22:51:05

已更新，以防有人遇到同样的问题。此后，NLTK 已升级为“通用”标记集，来源此处。标记文本后，请使用 map_tag 来简化标记。

import nltk
from nltk.tag import pos_tag, map_tag

text = nltk.word_tokenize("And now for something completely different")
posTagged = pos_tag(text)
simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged]
print(simplifiedTags)
# [('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]

Updated, in case anyone runs across the same problem. NLTK has since upgraded to a "universal" tagset, source here. Once you've tagged your text, use map_tag to simplify the tags.

import nltk
from nltk.tag import pos_tag, map_tag

text = nltk.word_tokenize("And now for something completely different")
posTagged = pos_tag(text)
simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged]
print(simplifiedTags)
# [('And', u'CONJ'), ('now', u'ADV'), ('for', u'ADP'), ('something', u'NOUN'), ('completely', u'ADV'), ('different', u'ADJ')]

回复收藏 0 原文

放肆 2024-11-10 22:51:05

要简化默认标记器中的标记，您可以使用 nltk.tag.simplify.simplify_wsj_tag，如下所示：

>>> import nltk
>>> from nltk.tag.simplify import simplify_wsj_tag
>>> tagged_sent = nltk.pos_tag(tokens)
>>> simplified = [(word, simplify_wsj_tag(tag)) for word, tag in tagged_sent]

To simplify tags from the default tagger, you can use nltk.tag.simplify.simplify_wsj_tag, like so:

>>> import nltk
>>> from nltk.tag.simplify import simplify_wsj_tag
>>> tagged_sent = nltk.pos_tag(tokens)
>>> simplified = [(word, simplify_wsj_tag(tag)) for word, tag in tagged_sent]

回复收藏 0 原文

白昼 2024-11-10 22:51:05

您只需在 pos_tag 方法中将 taget 属性设置为“universal”即可。

In [39]: from nltk import word_tokenize, pos_tag
...: 
...: text = word_tokenize("Here is a simple way of doing this")
...: tags = pos_tag(text, tagset='universal')
...: print(tags)
...: 
[('Here', 'ADV'), ('is', 'VERB'), ('a', 'DET'), ('simple', 'ADJ'), ('way', 'NOUN'), ('of', 'ADP'), ('doing', 'VERB'), ('this', 'DET')]

You can simply set the tagset attribute to 'universal' in the pos_tag method.

In [39]: from nltk import word_tokenize, pos_tag
...: 
...: text = word_tokenize("Here is a simple way of doing this")
...: tags = pos_tag(text, tagset='universal')
...: print(tags)
...: 
[('Here', 'ADV'), ('is', 'VERB'), ('a', 'DET'), ('simple', 'ADJ'), ('way', 'NOUN'), ('of', 'ADP'), ('doing', 'VERB'), ('this', 'DET')]

回复收藏 0 原文

~没有更多了~