关于情感分析的问题
我有一个关于情绪分析的问题需要帮助。
现在,我有一堆通过 Twitter 搜索 API 收集的推文。因为我使用了搜索词,所以我知道我想要查看的主题或实体(人名)是什么。我想知道其他人对这些人的看法。
首先,我下载了具有已知价/情绪分数的英语单词列表,并根据推文中这些单词的可用性计算情绪 (+/-)。问题是情绪是这样计算的——我实际上更多地关注推文的语气而不是关于这个人。
例如,我有这条推文:
“哈哈……A 就是个笑话。lmao!”
该消息显然是积极的语气,但 A 应该得到消极的语气。
为了改进我的情感分析,我可能可以考虑单词列表中的否定和修饰语。但我究竟如何才能进行情绪分析来查看消息的主题(可能还有讽刺)呢?
如果有人可以指导我获取一些资源,那就太好了......
I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在等待人工智能领域研究人员的答案时,我将为您提供一些关于您可以快速做什么的线索。
尽管这个主题需要自然语言处理、机器学习甚至心理学的知识,但你不必从头开始,除非你绝望或对该领域正在进行的研究质量不信任。
情感分析的一种可能方法是将其视为监督学习问题,其中您有一些小型训练语料库,其中包括人工注释(稍后介绍)和一个测试语料库,您可以在其中测试您的方法/系统的执行情况。为了进行训练,您将需要一些分类器,例如 SVM、HMM 或其他一些分类器,但要保持简单。我将从二元分类开始:好,坏。你可以对连续的意见范围(从正面到负面)做同样的事情,那就是获得一个排名,就像谷歌一样,最有价值的结果排在最前面。
对于启动检查 libsvm 分类器,它能够执行两种分类 {好,坏}和回归(排名)。
注释的质量将对您获得的结果产生巨大影响,但是从哪里获得注释呢?
我发现一个关于情绪分析的项目涉及餐馆。既有数据又有代码,因此您可以看到他们如何从自然语言中提取特征以及哪些特征在分类或回归中得分较高。
该语料库包含顾客对他们最近访问过的餐厅的意见,以及对食物、服务或氛围的一些反馈。
他们的观点和数字世界的联系通过他们给餐厅的星星数量来表达。您在一个网站上有自然语言,在另一网站上有餐厅的价格。
查看此示例,您可以针对所述问题设计自己的方法。
也可以看看nltk。使用 nltk,您可以进行词性标记,如果幸运的话,还可以获取名称。完成此操作后,您可以向分类器添加一个功能,如果在 n 个单词内(跳过 n-gram)有表达意见的单词(查看餐厅语料库)或使用您已有的权重,则该功能将为名称分配分数,但它是最好依靠分类器来学习权重,那是他的工作。
While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.
以目前的技术水平这是不可能的。
英语(以及任何其他语言)非常复杂,尚无法被程序“解析”。为什么?因为一切都必须是特殊情况。说某人是个笑话是一个笑话的特例,这是程序中的另一个例外。等等,等等。
一个很好的例子(由 ScienceFriction 在 SO 上的某个地方发布):
如果你愿意在这个主题上花费 +/-40 年的时间,那就继续吧,我们将不胜感激:)
In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)
我不完全同意 nightcracker 所说的。我同意这是一个难题,但我们在解决方案方面正在取得良好进展。
例如,“词性”可能会帮助您找出句子中的主语、动词和宾语。在丰田与惊悚片的例子中,“n-grams”可能会帮助您弄清楚上下文。查看 TagHelperTools。它建立在 weka 之上,提供词性和 n-gram 标记。
尽管如此,要得到OP想要的结果还是很困难,但用不了40年。
I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.