为什么 NLTK 的 Wordnet 词形还原器不对副词和形容词进行词形还原?
据我所知,如果我们识别每个标记对应的 PoS 标签,然后通过设置参数不仅对动词、名词进行词形还原,而且对形容词和副词形式进行词形还原,则可以在词形还原方面做得更好。
因此,我有这些代码行指定了上述所有四种类型,以便我可以返回“绝对”和“可爱”的根形式。然而,我仍然得到同样的话。
这里有三个问题:
- 有没有办法可以在我仍然使用同一个库的情况下解决这个问题?
- 是否有其他库或函数可以更好地进行词形还原?
- 这是否是 NLTK 的 Wordnet Lemmatization 的局限性之一,它无法完美地对所有类型的单词进行词形还原?
提前欣赏一下。
nltk.download('averaged_perceptron_tagger')
example=['absolutely', 'lovely']
print(nltk.pos_tag(example))
def get_pos_tags(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ, #adjective
"N": wordnet.NOUN,#noun
"V": wordnet.VERB,#verb
"R": wordnet.ADV} #adverb
return tag_dict.get(tag, wordnet.NOUN)
def lemmatize_text(text):
text=[WordNetLemmatizer().lemmatize(w, get_pos_tags(w)) for w in text]
return text
final_output=lemmatize_text(example)
print (final_output)
As I learned, we can do a better job on lemmatization if we identify corresponding PoS tags to each token and then try lemmatizing by setting arugments to lemmatize not only verb, noun but also adjective and adverbs forms.
So I've had these lines of code that specificed all the above four types so that I can return the root forms for 'absolutely' and 'lovely'. However, I still get the same words for these.
Three questions here:
- Is there a way that I can address this issue while I still use the same library?
- Is there other library or function that can do a better lemmatization?
- Is this one of the limitations of NLTK's Wordnet Lemmatization that it cannot perfectly lemmatize all types of words?
Appreciate it in advance.
nltk.download('averaged_perceptron_tagger')
example=['absolutely', 'lovely']
print(nltk.pos_tag(example))
def get_pos_tags(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ, #adjective
"N": wordnet.NOUN,#noun
"V": wordnet.VERB,#verb
"R": wordnet.ADV} #adverb
return tag_dict.get(tag, wordnet.NOUN)
def lemmatize_text(text):
text=[WordNetLemmatizer().lemmatize(w, get_pos_tags(w)) for w in text]
return text
final_output=lemmatize_text(example)
print (final_output)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于“可爱”和“绝对”这两个词,引理是相同的。您可以在 NLTK 中尝试使用以下一些接近的单词。
请注意,要获得正确的引理,您需要正确的词性 (pos) 标记,而要获得正确的词性标记,您需要解析包含该单词的结构良好的句子,以便标记器具有上下文。如果没有这个,您经常会得到错误的单词词性标签。
一般来说,NLTK 在词性标记和词形还原方面相当差。这是一个基于规则的旧库,并且没有使用更现代的技术。我一般不建议使用 NLTK。
Spacy 可能是最流行的 NLP 系统,它可以同时进行词性标记和词形还原(以及其他操作)步。不幸的是,Spacy 的词形还原器使用与 NLTK 相同的基本设计,虽然其性能更好,但仍然不是最好的。
Lemminflect 提供最佳的整体性能,但它只是一个引理/变形查找。它不包含 pos 标记器,因此您仍然需要从某个地方获取标记。 Lemminflect 还充当 spacy 的插件,将两者结合使用将为您提供最佳性能。 Lemminflect 的主页展示了如何做到这一点,以及与 NLTK 和 Spacy 相比的一些性能统计数据。
但是,请记住,如果没有正确的 pos 标记,您将无法获得正确的引理,并且对于 Spacy 或任何标记器来说,为了获得正确的结果,该单词需要位于一个完整的句子中。
For the words lovely and absolutely, the lemmas are the same. Here's a few close words you can try in NLTK.
Be aware that to get the correct lemma you need the correct part-of-speech (pos) tag, and to get the correct pos tag you need to parse a well formed sentence with the word in it, so the tagger has the context. Without this, you will often get the wrong pos tag for the word.
In general NLTK is a fairly poor at pos tagging and at lemmatization. It's an old library that is rule based and it doesn't use more modern techniques. I would generally not recommend using NLTK.
Spacy is probably the most popular NLP system and it will do pos tagging and lemmatization (among other things) all in the same step. Unfortunately Spacy's lemmatizer uses the same basic design as NLTK and while its performance is better, it's still not the best.
Lemminflect gives the best overall performance but it's only a lemma/inflection lookup. It doesn't include a pos tagger so you still need to get the tag from somewhere. Lemminflect also acts as a plug-in for spacy and using the two together will give you the best performance. Lemminflect's homepage shows how to do this along with some stats on performance compared to NLTK and Spacy.
However, remember that you won't get the right lemmas without the right pos tag and for Spacy, or any tagger, to get that right, the word needs to be in a full sentence.