如何使用 nltk 或 python 删除停用词

发布于 2024-10-28 05:20:04 字数 175 浏览 5 评论 0原文

我有一个数据集,我想从中删除停用词。

我使用 NLTK 获取停用词列表:

from nltk.corpus import stopwords

stopwords.words('english')

究竟如何将数据与停用词列表进行比较,从而从数据中删除停用词?

I have a dataset from which I would like to remove stop words.

I used NLTK to get a list of stop words:

from nltk.corpus import stopwords

stopwords.words('english')

Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

明明#如月 2024-11-04 05:20:04
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
不知所踪 2024-11-04 05:20:04

您还可以进行设置差异,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
孤檠 2024-11-04 05:20:04

要排除所有类型的停用词(包括 nltk 停用词),您可以执行以下操作:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]
空城缀染半城烟沙 2024-11-04 05:20:04

我想您有一个要从中删除停用词的单词列表(word_list)。你可以这样做:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
我还不会笑 2024-11-04 05:20:04

为此,有一个非常简单的轻量级 python 包 stop-words

首先使用以下命令安装软件包:
pip install stop-words

然后,您可以使用列表理解删除一行中的单词:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

这个包下载起来非常轻量(与 nltk 不同),适用于 Python 2Python 3 ,它还有许多其他语言的停用词,例如:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using:
pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian
赴月观长安 2024-11-04 05:20:04

如果您想立即将答案放入字符串(而不是过滤后的单词列表)中,这是我的看法:

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
等你爱我 2024-11-04 05:20:04

使用 textcleaner 库从数据中删除停用词。

请点击此链接:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

请按照以下步骤操作此库。

pip install textcleaner

安装后:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代码删除停用词。

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.

打小就很酷 2024-11-04 05:20:04
from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 
from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 
烏雲後面有陽光 2024-11-04 05:20:04

虽然问题有点老了,但这里有一个新的库,值得一提,它可以完成额外的任务。

在某些情况下,您不想只删除停用词。相反,您可能希望找到文本数据中的停用词并将其存储在列表中,以便您可以找到数据中的噪音并使其更具交互性。

该库称为'textfeatures'。您可以按如下方式使用它:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

例如,假设您有以下一组字符串:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

现在,调用 stopwords() 函数并传递您想要的参数:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

结果将是:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

如您所见,最后一列有该文档(记录)中包含的停用词。

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called 'textfeatures'. You can use it as follows:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

For example, suppose you have the following set of strings:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

The result is going to be:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

As you can see, the last column has the stop words included in that docoument (record).

汐鸠 2024-11-04 05:20:04

你可以使用这个功能,你应该注意到你需要降低所有的单词

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list
倒带 2024-11-04 05:20:04

使用过滤器

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
傲性难收 2024-11-04 05:20:04

我将向你展示一些例子
首先,我从数据框(twitter_df)中提取文本数据以进行如下进一步处理

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

然后进行标记化,我使用以下方法

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

然后,要删除停用词,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

我认为这会对您有所帮助

I will show you some example
First I extract the text data from the data frame (twitter_df) to process further as following

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

Then to tokenize I use the following method

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

Then, to remove stop words,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

I Think this will help you

对风讲故事 2024-11-04 05:20:04

试试这个:

sentence = " ".join([word for word in sentence.split() if word not in stopwords.words('english')]

Try this :

sentence = " ".join([word for word in sentence.split() if word not in stopwords.words('english')]
套路撩心 2024-11-04 05:20:04

如果您的数据存储为 Pandas DataFrame,您可以使用 textero 中的 remove_stopwords,该文本使用 默认

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文