Python NLTK:如何使用简化的词性标记集来标记句子?
Python NLTK 书籍的第 5 章给出了标记单词的示例一句话:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
nltk.pos_tag 调用默认标记器,它使用完整的标记集。本章后面的简化的标记集。
如何使用这套简化的词性标记来标记句子?
我是否正确理解了标记器,即我可以按照我的要求更改标记器使用的标记集,还是应该将其返回的标记映射到简化集,或者应该从新的标记器创建一个新的标记器,简单标记语料库?
Chapter 5 of the Python NLTK book gives this example of tagging words in a sentence:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
nltk.pos_tag calls the default tagger, which uses a full set of tags. Later in the chapter a simplified set of tags is introduced.
How can I tag sentences with this simplified set of part-of-speech tags?
Also have I understood the tagger correctly, i.e. can I change the tag set that the tagger uses as I'm asking, or should I map the tags it returns on to the simplified set, or should I create a new tagger from a new, simply-tagged corpus?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
已更新,以防有人遇到同样的问题。此后,NLTK 已升级为“通用”标记集,来源此处。标记文本后,请使用 map_tag 来简化标记。
Updated, in case anyone runs across the same problem. NLTK has since upgraded to a "universal" tagset, source here. Once you've tagged your text, use map_tag to simplify the tags.
要简化默认标记器中的标记,您可以使用
nltk.tag.simplify.simplify_wsj_tag
,如下所示:To simplify tags from the default tagger, you can use
nltk.tag.simplify.simplify_wsj_tag
, like so:您只需在 pos_tag 方法中将 taget 属性设置为“universal”即可。
You can simply set the tagset attribute to 'universal' in the pos_tag method.