无监督情感分析

发布于 2024-09-27 20:59:32 字数 155 浏览 9 评论 0原文

我读过很多文章,这些文章解释了在情绪分析系统真正发挥作用之前需要一组被分类为“积极”或“消极”的初始文本。

我的问题是:是否有人尝试对“积极”形容词与“消极”形容词进行基本检查,考虑到任何简单的否定词以避免将“不快乐”归类为积极?如果是这样,是否有任何文章讨论为什么该策略不现实?

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.

My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

醉城メ夜风 2024-10-04 20:59:32

Peter Turney 的经典论文 (2002) 解释了一种进行无监督情绪的方法仅使用单词优秀作为种子集进行分析(正/负分类)。 Turney 利用其他词与这两个形容词的相互信息达到了 74% 的准确率。

A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.

若无相欠,怎会相见 2024-10-04 20:59:32

我没有尝试过像您所描述的那样进行未经训练的情绪分析,但从我的角度来看,我想说您过于简单化了问题。仅仅分析形容词并不足以很好地把握文本的情感;例如,考虑“愚蠢”这个词。单独而言,您会将其归类为负面,但如果产品评论中有“...[x]产品让他们的竞争对手因为没有首先考虑此功能而显得愚蠢......”那么那里的情绪肯定会是积极的。在这样的事情中,单词出现的更大背景肯定很重要。这就是为什么仅未经训练的词袋方法(更不用说更有限的形容词袋)不足以充分解决这个问题。

预先分类的数据(“训练数据”)有助于将问题从尝试从头开始确定文本是否具有积极或消极情绪转变为尝试确定文本是否与积极文本或消极文本更相似,并以此方式对其进行分类。另一个要点是,诸如情感分析之类的文本分析常常受到不同领域文本特征差异的很大影响。这就是为什么拥有一组好的数据来训练(即来自您正在工作的领域内的准确数据,并且希望能够代表您将要分类的文本)与构建良好的数据集同样重要。系统进行分类。

不完全是一篇文章,但希望有所帮助。

I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.

The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.

Not exactly an article, but hope that helps.

累赘 2024-10-04 20:59:32

拉斯曼斯提到的Turney (2002)的论文是一篇很好的基础论文。在一项较新的研究中,Li 和 He [2009] 介绍了一种使用潜在狄利克雷分配(LDA)的方法)训练一个模型,该模型可以以完全无监督的方式同时对文章的整体情绪和主题进行分类。他们达到的准确率为 84.6%。

The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.

云柯 2024-10-04 20:59:32

我尝试了几种情感分析方法来挖掘评论中的观点。
对我来说最有效的是刘书中描述的方法:http://www. cs.uic.edu/~liub/WebMiningBook.html 在本书中,Liu 等人比较了许多策略并讨论了有关情感分析和意见挖掘的不同论文。

尽管我的主要目标是提取意见中的特征,但我实现了一个情感分类器来检测该特征的正面和负面分类。

我使用 NLTK 进行预处理(单词标记化、词性标记)和三元组创建。然后我还使用了该模型中的贝叶斯分类器来与刘精确指出的其他策略进行比较。

其中一种方法依赖于将表达此信息的每个三元组标记为正/负,并在此数据上使用某种分类器。
我尝试过并且效果更好的另一种方法(在我的数据集中大约有 85% 的准确度)是计算句子中每个单词和单词优秀/差的 PMI(准时互信息)分数总和作为 pos/neg 类的种子。

I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: http://www.cs.uic.edu/~liub/WebMiningBook.html In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.

Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.

I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.

One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.

入怼 2024-10-04 20:59:32

我尝试使用情感词典来发现关键字,以预测句子级别的情感标签。考虑到词汇的通用性(非领域相关),结果仅为 61% 左右。该论文可以在我的主页上找到。

在稍微改进的版本中,考虑了否定副词。整个系统名为 EmoLib,可用于演示:

http://dtminredis.housing。 salle.url.edu:8080/EmoLib/

问候,

I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.

In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:

http://dtminredis.housing.salle.url.edu:8080/EmoLib/

Regards,

蒲公英的约定 2024-10-04 20:59:32

大卫,

我不确定这是否有帮助,但你可能想看看雅各布·珀金的 关于使用 NLTK 进行情感分析的博客文章

David,

I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.

彻夜缠绵 2024-10-04 20:59:32

与任何其他类型的文本分析一样,情感分析没有神奇的“捷径”,旨在发现文本块的潜在“内容”。试图通过简单的“形容词”检查或类似的方法来简化经过验证的文本分析方法会导致歧义、不正确的分类等,最终导致您对情绪的解读准确性较差。来源越简洁(例如 Twitter),问题就越困难。

There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文