在 NLTK pos_tag 中,为什么要使用“hello”?被归类为名词?
我试过了
text = nltk.word_tokenize("hello, my name is John")
words = nltk.pos_tag(text)
for w in words:
print "%s = %s" % (w[0], w[1])
,我得到了:
hello = NN
, = ,
my = PRP$
name = NN
is = VBZ
John = NNP
I've tried
text = nltk.word_tokenize("hello, my name is John")
words = nltk.pos_tag(text)
for w in words:
print "%s = %s" % (w[0], w[1])
And I got:
hello = NN
, = ,
my = PRP$
name = NN
is = VBZ
John = NNP
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据 Penn Treebank 标签集,hello 绝对是一个感叹词,并且始终被标记为 UH。您遇到的问题是,NLTK 附带的标记器很可能是在 Penn Treebank 的《华尔街日报》部分免费提供的部分上进行训练的,不幸的是,其中包含单词 你好并且只有三个单词标记为UH(感叹词)。如果您想要标记口语文本,则需要在整个 Penn Treebank 上训练您的标记器,其中包括大约 300 万个英语口语单词。
顺便说一句,NLTK 标记器并不总是将 hello 称为名词 - 尝试标记“don't hello me!”或“他打招呼”。
According to the Penn Treebank tagset, hello is definitely an interjection and is consistently tagged UH. The problem you're running into is that the taggers that NLTK ships with were most likely trained on the part of the Wall Street Journal section of the Penn Treebank that is available for free, which unfortunately for you contains zero occurrences of the word hello and only three words tagged UH (interjection). If you want to tag spoken text, you'll need to train your tagger on the whole Penn Treebank, which includes something like 3 million words of spoken English.
By the way, the NLTK taggers won't always call hello a noun -- try tagging "don't hello me!" or "he said hello".
NLTK 使用自己的标记器来标记词性。
但准确度因文本而异。这是因为标注器是使用 NLTK 本身提供的语料库进行训练的。语料库可以是关于任何东西的。
语料库与您的文本不相似,那么标记器将无法标记您的文本,因为上下文、风格都非常不同。
如果您有时间,您可以训练自己的标注器。
计算机不是人类,计算机只是做我们告诉他们做的事情。所以为了让它正确地做,你应该正确地教导他们以达到最好的结果。
NLTK use it own tagger to tag part of speech.
But the accuracy will vary from text to text. It is because the tagger was trained using a corpus provided by NLTK itself. The corpus could be about anything.
The corpus is not similar to your text, then the tagger will fail to tag your text because the context, style is all very different.
You can train your own tagger if you got the time to do it.
Computer are not human, computer just do what we told them to do. So in order to make it do it properly, you should teach them properly to achieve best result.
查阅任何字典,您都会发现 hello 被定义为“名词”(例如 Longman)。它通常被描述为“感叹词”或“感叹词”,但标签“名词”并没有不正确。
Look in any dictionary and you will find hello defined as a "noun" (e.g. Longman). It's often described as an "exclamation" or "interjection" but the tag "noun" is not incorrect.