run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
如果您有一些分类数据(一堆示例问题及其正确答案),请首先训练一些简单的算法(例如 K 最近邻和感知器),然后看看是否会产生任何有意义的结果。 在您知道是否可以简单地或完全解决它之前,不要费心尝试以最佳方式解决它。
如果您没有任何分类数据,或者没有太多分类数据,请开始研究无监督学习算法。
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
如果您怀疑存在具有较低维度的“隐藏”特征向量,请说 N << 30 并且它本质上是非线性的,您将需要非线性降维。 您可以阅读内核 PCA 或最近的流形雕刻。
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets -SVMs -Decision Trees (this one worked really well for me when I was doing a similar problem) -Boosting with Decision trees/stumps -Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.
发布评论
评论(9)
听起来您是支持向量机的候选人。
去获取libsvm。 阅读他们分发的“SVM 分类实用指南”,该指南很短。
基本上,您将获取事件并对其进行格式化,如下所示:
通过其 svm-scale 实用程序运行它,然后使用其 grid.py 脚本来搜索适当的内核参数。 学习算法应该能够找出变量的不同重要性,尽管您也可以对事物进行加权。 如果您认为时间有用,只需添加时间作为另一个自变量(特征)供训练算法使用。
如果 libsvm 无法完全达到您想要的精度,请考虑升级到 SVMlight。 只是处理起来稍微困难一些,而且有更多的选择。
Bishop 的模式识别和机器学习可能是第一本了解详细信息的教科书libsvm 和 SVMlight 实际上对您的数据做了什么。
Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.
如果您有一些分类数据(一堆示例问题及其正确答案),请首先训练一些简单的算法(例如 K 最近邻和感知器),然后看看是否会产生任何有意义的结果。 在您知道是否可以简单地或完全解决它之前,不要费心尝试以最佳方式解决它。
如果您没有任何分类数据,或者没有太多分类数据,请开始研究无监督学习算法。
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
听起来任何类型的分类器都应该适用于这个问题:为实例(您的事件)找到最好的类(您的因变量)。 一个简单的起点可能是朴素贝叶斯分类。
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
这绝对是一个机器学习问题。 如果您了解 Java 并且想要一个不错的 GPL 库,那么您所要做的就是选择分类器并编写一些胶水,那么 Weka 是一个很好的选择。 R 可能不会削减那么多实例(事件,正如你所说的那样),因为它非常慢。 此外,在 R 中,您仍然需要查找或编写机器学习库,尽管考虑到它是一种统计语言,这应该很容易。
如果您认为您的特征(自变量)是条件独立的(即,在给定因变量的情况下独立),朴素贝叶斯是完美的分类器,因为它快速、可解释、准确且易于实现。 但是,通过 100,000 个实例和仅 30-50 个特征,您可能可以实现相当复杂的分类方案,以捕获数据中的大量依赖关系结构。 你最好的选择可能是支持向量机(Weka 中的 SMO)或随机森林(是的,这是一个愚蠢的名字,但它帮助随机森林流行起来。)如果你希望分类器即使在以一定的准确性为代价,也许直接的 J48 决策树会起作用。 我建议不要使用神经网络,因为它们非常慢,而且在实践中通常不会比支持向量机和随机森林更好。
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
集体智能编程一书有一个可行的示例,其中包含笔记本电脑价格预测器的源代码,这可能会对你来说是一个很好的起点。
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM 通常是最好的分类器。 这完全取决于您的问题和数据。 对于某些问题,其他机器学习算法可能会更好。 我见过神经网络(特别是循环神经网络)更擅长解决的问题。 这个问题没有正确的答案,因为它高度依赖于具体情况,但我同意 dsimcha 和 Jay 的观点,即 SVM 是正确的起点。
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
我相信您的问题是回归问题,而不是分类问题。 主要区别:在分类中,我们试图学习离散变量的值,而在回归中,我们试图学习连续变量的值。 所涉及的技术可能相似,但细节不同。 线性回归是大多数人首先尝试的方法。 如果线性回归不能解决问题,还有很多其他回归技术。
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
您提到您有 30-50 个自变量,其中一些比其他变量更重要。 因此,假设你有历史数据(或者我们所说的训练集),你可以使用PCA(主成分分析)或其他降维方法来减少自变量的数量。 这一步当然是可选的。 根据具体情况,保留每个变量可能会获得更好的结果,但根据它们的相关性为每个变量添加权重。 在这里,PCA 可以帮助您计算变量的“相关性”程度。
您还提到最近发生的事件应该更重要。 如果是这种情况,您可以提高最近事件的权重,降低较旧事件的权重。 请注意,事件的重要性不必随着时间线性增长。 如果它呈指数增长可能更有意义,因此您可以在这里使用数字。 或者,如果您不缺乏训练数据,也许您可以考虑丢弃太旧的数据。
正如 Yuval F 所说,这看起来确实更像是回归问题而不是分类问题。 因此,您可以尝试SVR(支持向量回归),它是SVM(支持向量机)的回归版本。
您可以尝试的其他一些方法是:
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
您所描述的是一个经典的分类问题。 在我看来,当你拥有像 Weka 这样的工具时,为什么还要编写新的算法呢? 如果我是你,我会使用 10 倍(或 k 倍)交叉验证来运行一系列监督学习算法(我不完全理解为什么人们建议首先使用无监督学习,因为这显然是一个分类问题) ,如果我记得的话,这是 Weka 中的默认值,看看你会得到什么结果! 我会尝试:
-神经网络
-SVM
-决策树(当我做类似问题时,这个决策树对我来说非常有效)
-使用决策树/树桩进行提升
-还要别的吗!
Weka 让事情变得如此简单,您确实可以获得一些有用的信息。 我刚刚参加了机器学习课程,并且我完全按照您尝试使用上述算法执行的操作,所以我知道您的情况。 对我来说,决策树桩的提升效果非常好。 (顺便说一句,Boosting 实际上是一种元算法,可以应用于大多数监督学习算法,通常可以增强其结果。)
使用决策树(如果您使用 ID3 或类似品种)的一个好处是,它可以选择属性按照它们区分数据的程度进行划分 - 换句话说,哪些属性基本上最快地确定分类。 因此,您可以在运行算法后检查树,看看漫画书的哪个属性对价格影响最大 - 它应该是树的根。
编辑:我认为尤瓦尔是对的,我没有注意分类的离散化价格值的问题。 然而,我不知道 Weka 中是否可以使用回归,并且您仍然可以很容易地将分类技术应用于这个问题。 您需要创建价格值类别,例如漫画的多个价格范围,以便您可以拥有代表漫画价格的离散数字(例如 1 到 10)。 然后你就可以轻松地运行分类了。
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-SVMs
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.