帮助组织我的数据来解决这个机器学习问题
我想将推文分类为一组给定的类别,例如{'sports'、'entertainment'、'love'}等...
我的想法是采用最常用单词的术语频率来帮助我解决这个问题。例如,“爱”一词在爱类别中出现最频繁,但它也以“我喜欢这个游戏”和“我喜欢这部电影”的形式出现在体育和娱乐中。
为了解决这个问题,我设想了一个 3 轴图,其中 x 值是我的推文中使用的所有单词,y 值是类别,z 值是相对于以下内容的术语频率(或某种类型的分数)单词和类别。然后,我会将推文分解到图表上,然后将每个类别中的 z 值相加。总 z 值最高的类别很可能是正确的类别。我知道这很令人困惑,所以让我举个例子:
“观看”这个词在体育和娱乐中出现很多(“我正在观看比赛”和“我正在观看我最喜欢的节目”)...因此,我至少将其范围缩小到这两类。但“游戏”这个词在娱乐中并不经常出现,而“表演”在体育运动中并不经常出现。 “观看”+“游戏”的 Z 值在体育类别中最高,而“观看”+“表演”在娱乐类别中最高。
现在您已经了解了我的想法是如何运作的,我需要帮助组织这些数据,以便机器学习算法可以在我给它一个单词或一组单词时预测类别。我读过很多关于 SVM 的文章,我认为它们是正确的选择。我尝试了 libsvm,但我似乎无法想出一个好的输入集。此外,libsvm 不支持非数字值,这增加了复杂性。
有什么想法吗?我是否需要一个库,或者我应该自己编写决策代码?
谢谢大家,我知道这很长,抱歉。
I want to categorize tweets within a given set of categories like {'sports', 'entertainment', 'love'}, etc...
My idea is to take the term frequencies of the most commonly used words to help me solve this problem. For example, the word 'love' shows up most frequently in the love category but it also shows up in sports and entertainment in the form of "I love this game" and "I love this movie".
To solve it, I envisioned a 3-axis graph where the x values are all the words used in my tweets, the y values are the categories, and the z values are the term frequencies (or some type of score) with the respect to the word and the category. I would then break up the tweet onto the graph and then add up the z values within each category. The category with the highest total z value is most likely the correct category. I know this is confusing, so let me give an example:
The word 'watch' shows up a lot in sports and entertainment ("I am watching the game" and "I am watching my favorite show") ...Therefore, I narrowed it down to these two categories at the least. But the word 'game' does not show up often in entertainment and show does not show up often in sports. the Z value for 'watch' + 'game' will be highest for the sports category and 'watch' + 'show' will be highest for entertainment.
Now that you understand how my idea works, I need help organizing this data so that a machine learning algorithm can predict categories when I give it a word or set of words. I've read a lot about SVMs and I think they're the way to go. I tried libsvm, but I can't seem to come up with a good input set. Also, libsvm does not support non-numeric values, which is adding more complexity.
Any ideas? Do I even need a library, or should I just code up the decision-making myself?
Thanks all, I know this was long, sorry.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,您正在尝试将文本分类为一组类别。 朴素贝叶斯 就是这样做的。事实上,它是你的想法的统计模拟。它假设文本中单词的频率是类别的独立指标,并基于此假设给出每个类别的概率。在实践中效果很好;我相信 Weka 有一个实现。
Well you are trying to do text classification into a group of categories. Naive Bayes does this. In fact, it is a statistical analogue of your idea. It assumes that frequency of words in a text are independent indicators of a category and gives a probability of each category based on this assumption. It works well in practice; I believe Weka has an implementation.
你必须根据文档的内容(单词特征)对文档(这里推文是你的文档)进行分类,并将它们放入类别(体育、环境、爱情等)中。
您可以使用朴素贝叶斯分类器或Fisher分类器(我更喜欢Fisher)对您的文档进行分类。您可以在 python 库中找到两者的实现。
使用词干提取、小写、停用词(the、is、at等)去除等预处理技术来提高效率。
您所需要的只是阅读《集体智能编程:构建智能 Web 2.0 应用程序》一书的第 6 章(文档过滤)。它对分类器以及 python 中的示例和实现都有很好的解释。
You have to classify Documents ( here tweets are your documents ) based on their contents(words-features) and put them in the categories (sports, environment, love etc).
You can use Naive Bayes Classifier or Fisher Classifier ( I prefer Fisher ) to categorize your documents. You can find the implementation of both in python libraries.
Use stemming,lower-casing,stop-word(the,is,at etc) removal and other pre-processing techniques to increase the efficiency.
All you need is to go through Chapter 6 ( Document Filtering ) of the book Programming Collective Intelligence:Building Smart Web 2.0 Applications. It has good explanation of both the classifiers plus examples and implementation in python .