Stemmer和Lemmatizer似乎都无法正常工作,我该怎么办?

发布于 2025-01-29 09:10:52 字数 1675 浏览 6 评论 0原文

我是文本分析的新手,并且正在尝试创建一袋单词模型(使用Sklearn's CountVectorizer方法)。我有一个带有文本列的数据框,上面有“酸”,“酸性”,“酸性”,“木头”,“木质”,“木质”。

我认为“酸”和“木材”应该是最终输出中唯一包含的单词,但是似乎既不止动和诱捕似乎都可以做到这一点。

茎产生“酸”,“木头”,“伍德迪”,伍德西' 诱饵会产生较差的“酸”“酸性”“酸性”“木材”“木质''''''。我认为这是由于语音的一部分未准确指定,尽管我不确定该规范应该去哪里。我已将其包含在x = vectorizer.fit_transform(df ['text'],'a')(我相信大多数单词应该是形容词)中输出的差异。

我该怎么做才能改善输出?

我的完整代码在下面;

!pip install nltk
nltk.download('omw-1.4')  
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer 

数据框架:

df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']

带词干的计数量:

analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()

lemmatizer的countvectorizer:

analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def lemed_words(doc):
    return(lemmatizer.lemmatize(w) for w in analyzer(doc))

vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()

I am new to text analysis and am trying to create a bag of words model(using sklearn's CountVectorizer method). I have a data frame with a column of text with words like 'acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody'.

I think that 'acid' and 'wood' should be the only words included in the final output, however neither stemming nor lemmatizing seems to accomplish this.

Stemming produces 'acid','wood','woodi',woodsi'
and lemmatizing produces a worse output of 'acid' 'acidic' 'acidity' 'wood' 'woodsy' 'woody'. I assume this is due to the part of speech not being specified accurately although I am not sure where this specification should go. I have included it in the line X = vectorizer.fit_transform(df['text'],'a') (I believe that most of the words should be adjectives) however, it does not make a difference in the output.

What can I do to improve the output?

My full code is below;

!pip install nltk
nltk.download('omw-1.4')  
import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer 

Data Frame:

df = pd.DataFrame()
df['text']=['acid', 'acidic', 'acidity', 'wood', 'woodsy', 'woody']

CountVectorizer with Stemmer:

analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

vectorizer = CountVectorizer(stop_words='english',analyzer=stemmed_words)
X = vectorizer.fit_transform(df['text'])
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()

CountVectorizer with Lemmatizer:

analyzer = CountVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def lemed_words(doc):
    return(lemmatizer.lemmatize(w) for w in analyzer(doc))

vectorizer = CountVectorizer(stop_words='english',analyzer=lemed_words)
X = vectorizer.fit_transform(df['text'],'a')
df_bow_sklearn = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names())
df_bow_sklearn.head()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

单身狗的梦 2025-02-05 09:10:52

对于WordNetlemmatizer和Stemmer来说,可能是一个简单的表现不佳问题。

尝试不同的...
Stemmers:

  • Porter( - >来自NLTK.Stem Import Porterstemmer)
  • Lancaster( - >来自nltk.Stem Import Lancasterstemmer)

lemmatizers:

  • spacy( - > incorm import spacy)
  • iwnlp( - >> spacy_iwnlp import Spacyii spacyiii spacyiiwnlp)
  • hanta( - &&&&gt inta来自Hanta的汉诺威(Hanta)汉诺威(Hanovertagger) /注:或多或少的德语培训)

存在相同的问题,并且转到另一个茎和lemmatizer解决了这个问题。有关如何支撑词干和诱导器的仔细指导,请快速搜索在所有情况下揭示了很好的示例。

Might be a simple under-performing issue with the wordnetlemmatizer and the stemmer.

Try different ones like...
Stemmers:

  • Porter ( -> from nltk.stem import PorterStemmer)
  • Lancaster (-> from nltk.stem import LancasterStemmer)

Lemmatizers:

  • spacy ( -> import spacy)
  • IWNLP ( -> from spacy_iwnlp import spaCyIWNLP)
  • HanTa ( -> from HanTa import HanoverTagger /Note: is more or less trained for german language)

Had the same issue and switching to a different Stemmer and Lemmatizer solved the issue. For closer instruction on how to propperly implement the stemmers and lemmatizers, a quick search on the web reveals good examples on all cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文