在 NLTK pos_tag 中,为什么要使用“hello”?被归类为名词?

发布于 2024-11-06 00:51:04 字数 266 浏览 5 评论 0原文

我试过了

text = nltk.word_tokenize("hello, my name is John")
words = nltk.pos_tag(text)

for w in words:
    print "%s = %s" % (w[0], w[1])

,我得到了:

hello = NN
, = ,
my = PRP$
name = NN
is = VBZ
John = NNP

I've tried

text = nltk.word_tokenize("hello, my name is John")
words = nltk.pos_tag(text)

for w in words:
    print "%s = %s" % (w[0], w[1])

And I got:

hello = NN
, = ,
my = PRP$
name = NN
is = VBZ
John = NNP

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一向肩并 2024-11-13 00:51:04

根据 Penn Treebank 标签集,hello 绝对是一个感叹词,并且始终被标记为 UH。您遇到的问题是,NLTK 附带的标记器很可能是在 Penn Treebank 的《华尔街日报》部分免费提供的部分上进行训练的,不幸的是,其中包含单词 你好并且只有三个单词标记为UH(感叹词)。如果您想要标记口语文本,则需要在整个 Penn Treebank 上训练您的标记器,其中包括大约 300 万个英语口语单词。

顺便说一句,NLTK 标记器并不总是将 hello 称为名词 - 尝试标记“don't hello me!”或“他打招呼”。

According to the Penn Treebank tagset, hello is definitely an interjection and is consistently tagged UH. The problem you're running into is that the taggers that NLTK ships with were most likely trained on the part of the Wall Street Journal section of the Penn Treebank that is available for free, which unfortunately for you contains zero occurrences of the word hello and only three words tagged UH (interjection). If you want to tag spoken text, you'll need to train your tagger on the whole Penn Treebank, which includes something like 3 million words of spoken English.

By the way, the NLTK taggers won't always call hello a noun -- try tagging "don't hello me!" or "he said hello".

叹倦 2024-11-13 00:51:04

NLTK 使用自己的标记器来标记词性。

但准确度因文本而异。这是因为标注器是使用 NLTK 本身提供的语料库进行训练的。语料库可以是关于任何东西的。

语料库与您的文本不相似,那么标记器将无法标记您的文本,因为上下文、风格都非常不同。

如果您有时间,您可以训练自己的标注器。

计算机不是人类,计算机只是做我们告诉他们做的事情。所以为了让它正确地做,你应该正确地教导他们以达到最好的结果。

NLTK use it own tagger to tag part of speech.

But the accuracy will vary from text to text. It is because the tagger was trained using a corpus provided by NLTK itself. The corpus could be about anything.

The corpus is not similar to your text, then the tagger will fail to tag your text because the context, style is all very different.

You can train your own tagger if you got the time to do it.

Computer are not human, computer just do what we told them to do. So in order to make it do it properly, you should teach them properly to achieve best result.

空城之時有危險 2024-11-13 00:51:04

查阅任何字典,您都会发现 hello 被定义为“名词”(例如 Longman)。它通常被描述为“感叹词”或“感叹词”,但标签“名词”并没有不正确。

Look in any dictionary and you will find hello defined as a "noun" (e.g. Longman). It's often described as an "exclamation" or "interjection" but the tag "noun" is not incorrect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文