为什么这些词被视为停用词?
我没有自然语言处理的正式背景,想知道 NLP 方面的人是否可以对此有所了解。我正在使用 NLTK 库,并且专门研究了此包提供的停用词功能:
在[80]中: nltk.corpus.stopwords.words('english')
输出[80]:
['我', '我', '我的', '我自己','我们','我们的','我们的', ‘我们自己’、‘你们’、‘你们的’、 '你的','你自己','你们自己', '他','他','他的','他自己', ‘她’,‘她’,‘她的’,‘她自己’, '它','它','它自己','他们', '他们','他们的','他们的', '他们自己','什么','哪个', ‘谁’、‘谁’、‘这个’、‘那个’、 '这些','那些','是','是', '是','是','是','是', '曾经'、'正在'、'有'、'有'、 '有'、'有'、'做'、'做'、 '做了'、'做'、'一个'、'一个'、'那个'、 '并且'、'但是'、'如果'、'或者'、 '因为'、'作为'、'直到'、'同时'、 “的”、“在”、“通过”、“为”、“与”、 '关于'、'反对'、'之间'、 '进入'、'通过'、'期间'、 “之前”、“之后”、“之上”、 “下方”、“至”、“自”、“上方”、 ‘下’、‘进’、‘出’、‘开’、‘关’、 “之上”、“之下”、“再次”、 '进一步','然后','曾经','这里', ‘那里’、‘何时’、‘何处’、‘为什么’、 “如何”、“全部”、“任何”、“两者”、 '每个','很少','更多','大多数', '其他','一些','这样','不', '也不'、'不'、'仅'、'拥有'、 '相同'、'所以'、'比'、'也是'、 '非常'、's'、't'、'能够'、'将会'、 '只是','不要','应该','现在']
我不明白的是,为什么出现“不”这个词?难道不需要确定句子中的情感吗?例如,这样一句话:
我不确定问题是什么。
一旦删除停用词 not
,将句子的含义更改为相反的意思(我确定问题是什么
),情况就完全不同了。如果是这样的话,我是否缺少一套何时不使用这些停用词的规则?
I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:
In [80]:
nltk.corpus.stopwords.words('english')Out[80]:
['i', 'me', 'my',
'myself', 'we', 'our', 'ours',
'ourselves', 'you', 'your',
'yours', 'yourself', 'yourselves',
'he', 'him', 'his', 'himself',
'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they',
'them', 'their', 'theirs',
'themselves', 'what', 'which',
'who', 'whom', 'this', 'that',
'these', 'those', 'am', 'is',
'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has',
'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the',
'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while',
'of', 'at', 'by', 'for', 'with',
'about', 'against', 'between',
'into', 'through', 'during',
'before', 'after', 'above',
'below', 'to', 'from', 'up',
'down', 'in', 'out', 'on', 'off',
'over', 'under', 'again',
'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why',
'how', 'all', 'any', 'both',
'each', 'few', 'more', 'most',
'other', 'some', 'such', 'no',
'nor', 'not', 'only', 'own',
'same', 'so', 'than', 'too',
'very', 's', 't', 'can', 'will',
'just', 'don', 'should', 'now']
What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:
I am not sure what the problem is.
is totally different once the stopword not
is removed changing the meaning of the sentence to its opposite (I am sure what the problem is
). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
停用词列表的概念没有普遍意义,取决于您想要做什么。如果您的任务需要了解短语的极性、情感或类似特征,并且您的方法依赖于检测否定(如您的示例中所示),那么显然您不应该删除“not”作为停用词(请注意,您可能仍然想删除其他非常常见的不相关单词,这些单词将构成您的新停用词列表)。
然而,回答你的问题,大多数情感分析方法都非常肤浅。他们寻找充满情感/感情的单词,而且大多数时候,他们不会尝试对句子进行深入分析。
另一个例子,你想保留停用词:如果你试图根据作者(作者归属)对文档进行分类或进行文体计量学,你绝对应该保留这些功能词,因为它们表征了文档的很大一部分。风格和话语。
然而,对于许多其他类型的分析(例如词空间模型、文档相似性、搜索等),删除非常常见的功能词在计算上(您处理更少的词)和在某些情况下实际上(您甚至可能会得到更好的结果)都是有意义的删除停用词)。如果我试图了解某个特定单词经常使用的上下文,我希望看到内容单词,而不是功能单词。
The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).
However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.
As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.
However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.