如何使用 nltk 或 python 删除停用词
我有一个数据集,我想从中删除停用词。
我使用 NLTK 获取停用词列表:
from nltk.corpus import stopwords
stopwords.words('english')
究竟如何将数据与停用词列表进行比较,从而从数据中删除停用词?
I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
您还可以进行设置差异,例如:
You could also do a set diff, for example:
要排除所有类型的停用词(包括 nltk 停用词),您可以执行以下操作:
To exclude all type of stop-words including nltk stop-words, you could do something like this:
我想您有一个要从中删除停用词的单词列表(word_list)。你可以这样做:
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
为此,有一个非常简单的轻量级 python 包
stop-words
。首先使用以下命令安装软件包:
pip install stop-words
然后,您可以使用列表理解删除一行中的单词:
这个包下载起来非常轻量(与 nltk 不同),适用于
Python 2
和Python 3
,它还有许多其他语言的停用词,例如:There's a very simple light-weight python package
stop-words
just for this sake.Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
This package is very light-weight to download (unlike nltk), works for both
Python 2
andPython 3
,and it has stop words for many other languages like:如果您想立即将答案放入字符串(而不是过滤后的单词列表)中,这是我的看法:
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
使用 textcleaner 库从数据中删除停用词。
请点击此链接:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
请按照以下步骤操作此库。
安装后:
使用上面的代码删除停用词。
Use textcleaner library to remove stopwords from your data.
Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
Follow these steps to do so with this library.
After installing:
Use above code to remove the stop-words.
虽然问题有点老了,但这里有一个新的库,值得一提,它可以完成额外的任务。
在某些情况下,您不想只删除停用词。相反,您可能希望找到文本数据中的停用词并将其存储在列表中,以便您可以找到数据中的噪音并使其更具交互性。
该库称为
'textfeatures'
。您可以按如下方式使用它:例如,假设您有以下一组字符串:
现在,调用 stopwords() 函数并传递您想要的参数:
结果将是:
如您所见,最后一列有该文档(记录)中包含的停用词。
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called
'textfeatures'
. You can use it as follows:For example, suppose you have the following set of strings:
Now, call the stopwords() function and pass the parameters you want:
The result is going to be:
As you can see, the last column has the stop words included in that docoument (record).
你可以使用这个功能,你应该注意到你需要降低所有的单词
you can use this function, you should notice that you need to lower all the words
使用过滤器:
using filter:
我将向你展示一些例子
首先,我从数据框(
twitter_df
)中提取文本数据以进行如下进一步处理然后进行标记化,我使用以下方法
然后,要删除停用词,
我认为这会对您有所帮助
I will show you some example
First I extract the text data from the data frame (
twitter_df
) to process further as followingThen to tokenize I use the following method
Then, to remove stop words,
I Think this will help you
试试这个:
Try this :
如果您的数据存储为
Pandas DataFrame
,您可以使用 textero 中的remove_stopwords
,该文本使用 默认。In case your data are stored as a
Pandas DataFrame
, you can useremove_stopwords
from textero that use the NLTK stopwords list by default.