将单词添加到 nltk 非索引字表

发布于 2024-10-29 04:38:54 字数 281 浏览 3 评论 0原文

我有一些代码可以从我的数据集中删除停用词,因为停用词列表似乎没有删除我也想要的大部分单词,我希望将单词添加到此停用词列表中,以便它将删除他们对于这个案子。 我用来删除停用词的代码是:

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

我不确定添加单词的正确语法,并且似乎无法在任何地方找到正确的语法。任何帮助表示赞赏。谢谢。

I have some code that removes stop words from my data set, as the stop list doesn't seem to remove a majority of the words I would like it too, I'm looking to add words to this stop list so that it will remove them for this case.
The code i'm using to remove stop words is:

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

I'm unsure of the correct syntax for adding words and can't seem to find the correct one anywhere. Any help is appreciated. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

祁梦 2024-11-05 04:38:54

您可以简单地使用追加方法向其中添加单词:

stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('newWord')

或扩展以附加单词列表,正如查理在评论中所建议的那样。

stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['stopWord1','stopWord2']
stopwords.extend(newStopWords)

You can simply use the append method to add words to it:

stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('newWord')

or extend to append a list of words, as suggested by Charlie on the comments.

stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['stopWord1','stopWord2']
stopwords.extend(newStopWords)
半步萧音过轻尘 2024-11-05 04:38:54
import nltk
stopwords = nltk.corpus.stopwords.words('english')
new_words=('re','name', 'user', 'ct')
for i in new_words:
    stopwords.append(i)
print(stopwords)
import nltk
stopwords = nltk.corpus.stopwords.words('english')
new_words=('re','name', 'user', 'ct')
for i in new_words:
    stopwords.append(i)
print(stopwords)
梦情居士 2024-11-05 04:38:54

我在 Ubuntu 机器上的做法是,我在 root 中按 ctrl + F 切换“停用词”。它给了我一个文件夹。我走进里面,里面有不同的文件。我打开“英语”,只有128个单词。添加了我的话。保存并完成。

The way how I did on my Ubuntu machine was, I ctrl + F for "stopwords" in root. It gave me a folder. I stepped inside it which had different files. I opened "english" which had barely 128 words. Added my words to it. Saved and done.

一曲爱恨情仇 2024-11-05 04:38:54

英语停用词是 nltk/corpus/stopwords/english.txt 中的一个文件(我猜它会在这里...我这台机器上没有 nltk..最好的办法是在 nltk 存储库中搜索“english.txt”)

您只需在此文件中添加新的停用词即可。

如果您的停用词列表增加到数百个,也可以尝试查看 bloom 过滤器

The english stop words is a file within nltk/corpus/stopwords/english.txt (I guess it would be here...i dont have nltk on this machine..best thing would be to search 'english.txt within nltk repo)

You can just add your new stop words in this file.

also try looking at bloom filters if your stop word list increases to few hundreds

诠释孤独 2024-11-05 04:38:54

我总是在任何需要它的模块的顶部执行 stopset = set(nltk.corpus.stopwords.words('english')) 。然后可以轻松地向集合中添加更多单词,并且成员资格检查速度更快。

I always do stopset = set(nltk.corpus.stopwords.words('english')) at the top of any module that needs it. Then it's easy to add more words to the set, plus membership checks are faster.

无风消散 2024-11-05 04:38:54

也在寻找解决方案。经过一番尝试和错误后,我必须将单词添加到非索引字表中。希望这有帮助。

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
return new_str

Was also looking for solution on this. After some trail and error I got to add words to the stoplist. Hope this helps.

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
return new_str
薄凉少年不暖心 2024-11-05 04:38:54
 import nltk
 nltk.download('stopwords')
 from nltk.corpus import stopwords
 #add new words to the list
 new_stopwords = ["new", "custom", "words", "add","to","list"]
 stopwrd = nltk.corpus.stopwords.words('english')
 stopwrd.extend(new_stopwords)
 import nltk
 nltk.download('stopwords')
 from nltk.corpus import stopwords
 #add new words to the list
 new_stopwords = ["new", "custom", "words", "add","to","list"]
 stopwrd = nltk.corpus.stopwords.words('english')
 stopwrd.extend(new_stopwords)
调妓 2024-11-05 04:38:54

我使用此代码将新的停用词添加到 python 中的 nltk 停用词列表中

from nltk.corpus import stopwords
#...#
stop_words = set(stopwords.words("english"))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['apple','mango','banana']
new_stopwords_list = stop_words.union(new_stopwords)

print(new_stopwords_list)

I use this code for adding new stop words to nltk stop word list in python

from nltk.corpus import stopwords
#...#
stop_words = set(stopwords.words("english"))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['apple','mango','banana']
new_stopwords_list = stop_words.union(new_stopwords)

print(new_stopwords_list)
从此见与不见 2024-11-05 04:38:54

我发现(Python 3.7、Windows 10 上的 jupyter 笔记本、企业防火墙)
创建列表并使用“附加”命令会导致整个停用词列表被附加为原始列表的元素。

这使得“停用词”成为列表的列表。

Snijesh 的答案很有效,Jayantha 的答案也很有效。

I've found (Python 3.7, jupyter notebook on Windows 10, corporate firewall)
that creating a list and using the 'append' command results in the entire stopwords list being appended as an element of the original list.

This makes 'stopwords' into a list of lists.

Snijesh's answer works well, as does Jayantha's answer.

梨涡少年 2024-11-05 04:38:54

STOP_WORDS.add(“Lol”) #根据需要添加新的停用词到语料库

STOP_WORDS.add(“Lol”) #Add new stopword into corpus as you wish

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文