为什么我的话语lemmatization无法正常工作?

发布于 2025-01-26 09:18:15 字数 2322 浏览 2 评论 0原文

嗨,Stackoverflow社区! 长期读者,但首次海报。我目前正在尝试NLP的手,并且在阅读了一些有关此主题的论坛帖子后,我似乎无法让Lemmatizer正常工作(下面的功能)。比较我的原始文本与预处理的文本,除了柠檬水之外,所有清洁步骤都可以按预期工作。我什至尝试过指定语音的一部分:“ v”不要将单词默认为名词,并且仍然获得动词的基本形式(ex:ext:thew - > turn,是 - 阅读)...但是这似乎不起作用。

欣赏另一套眼睛和反馈 - 谢谢!

# key imports

import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions


# cleaning functions

def to_lower(text):
    '''
    Convert text to lowercase
    '''
    return text.lower()

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation)

def remove_stopwords(text):
    '''
    Removes stop words which don't have meaning (ex: is, the, a, etc.)
    '''
    additional_stopwords = ['app']

    stop_words = set(stopwords.words('english')) - set(['not','out','in']) 
    stop_words = stop_words.union(additional_stopwords)
    return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])

def fix_contractions(text):
    '''
    Expands contractions
    '''
    return contractions.fix(text)



# preprocessing pipeline

def preprocess(text):
    # convert to lower case
    lower_text = to_lower(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []      
            
    for each_sent in sentence_tokens:
        # fix contractions
        clean_text = fix_contractions(each_sent)
        # remove punctuation
        clean_text = remove_punct(clean_text)
        # filter out stop words
        clean_text = remove_stopwords(clean_text)
        # get base form of word
        wnl = WordNetLemmatizer()
        for part_of_speech in ['v']:
            lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
        # split the sentence into word tokens
        word_tokens = word_tokenize(lemmatized_word)
        for i in word_tokens:
            word_list.append(i)                     
    return word_list

# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'



sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)

Hi stackoverflow community!
Long-time reader but first-time poster. I'm currently trying my hand at NLP and after reading a few forum posts touching upon this topic, I can't seem to get the lemmatizer to work properly (function pasted below). Comparing my original text vs preprocessed text, all the cleaning steps work as expected, except the lemmatization. I've even tried specifying the part of speech : 'v' to not default the word as noun, and still get the base form of the verb (ex: turned -> turn , are -> be, reading -> read) ... however this doesn't seem to be working.

Appreciate another set of eyes and feedback - thanks!

# key imports

import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions


# cleaning functions

def to_lower(text):
    '''
    Convert text to lowercase
    '''
    return text.lower()

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation)

def remove_stopwords(text):
    '''
    Removes stop words which don't have meaning (ex: is, the, a, etc.)
    '''
    additional_stopwords = ['app']

    stop_words = set(stopwords.words('english')) - set(['not','out','in']) 
    stop_words = stop_words.union(additional_stopwords)
    return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])

def fix_contractions(text):
    '''
    Expands contractions
    '''
    return contractions.fix(text)



# preprocessing pipeline

def preprocess(text):
    # convert to lower case
    lower_text = to_lower(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []      
            
    for each_sent in sentence_tokens:
        # fix contractions
        clean_text = fix_contractions(each_sent)
        # remove punctuation
        clean_text = remove_punct(clean_text)
        # filter out stop words
        clean_text = remove_stopwords(clean_text)
        # get base form of word
        wnl = WordNetLemmatizer()
        for part_of_speech in ['v']:
            lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
        # split the sentence into word tokens
        word_tokens = word_tokenize(lemmatized_word)
        for i in word_tokens:
            word_list.append(i)                     
    return word_list

# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'



sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

荭秂 2025-02-02 09:18:15

柠檬酸与茎非常相似,因为它将一组变形的单词降低到一个普通词。不同之处在于,狐猴化将其变为其真实的根词,这称为引理。如果我们以“ Amaze”,“ Amazing”,“惊人的”为单词,那么所有这些的引理都是“ Amaze”。与通常会返回“惊人”的茎相比。通常,诱饵比茎更先进。

words = ['amaze', 'amazed', 'amazing']

我们将再次使用nltk进行诱饵。我们还需要确保下载了WordNet数据库,该数据库将作为我们的Lemmatizer的查找,以确保它产生了真正的引理。

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print([lemmatizer.lemmatize(word) for word in words] )
#outputs ['amaze', 'amazed', 'amazing']

显然没有发生任何事情,那是因为lemmatization要求我们还提供词性(pos)标签,这是基于语法的单词类别。例如名词,形容词或动词。在我们的情况下,我们可以将每个单词作为动词,然后可以这样实现:

from nltk.corpus import wordnet

print([lemmatizer.lemmatize(word, wordnet.VERB) for word in words])
#outputs['amaze', 'amaze', 'amaze']

Lemmatization is very similiar to stemming in that it reduces a set of inflected words down to a common word. The difference is that lemmatization reduces inflections down to their real root words, which is called a lemma. If we take the words 'amaze', 'amazing', 'amazingly', the lemma of all of these is 'amaze'. Compared to stemming which would usually return 'amaz'. Generally lemmatization is seen as more advanced than stemming.

words = ['amaze', 'amazed', 'amazing']

We will use NLTK again for our lemmatization. We also need to ensure we have the WordNet Database downloaded which will act as the lookup for our lemmatizer to ensure that it has produced a real lemma.

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print([lemmatizer.lemmatize(word) for word in words] )
#outputs ['amaze', 'amazed', 'amazing']

Clearly nothing has happened, and that is because lemmatization requires that we also provide the parts-of-speech (POS) tag, which is the category of a word based on syntax. For example noun, adjective, or verb. In our case we could place each word as a verb, which we can then implement like so:

from nltk.corpus import wordnet

print([lemmatizer.lemmatize(word, wordnet.VERB) for word in words])
#outputs['amaze', 'amaze', 'amaze']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文