最先进的分类算法
我们知道有大约一千个分类器,最近有人告诉我,有些人说 adaboost
就像脱壳的。
- 有没有更好的算法(与 那个投票想法)
- 目前的最新技术是什么 分类器。你有例子吗?
We know there are like a thousand of classifiers, recently I was told that, some people say adaboost
is like the out of the shell one.
- Are There better algorithms (with
that voting idea) - What is the state of the art in
the classifiers.Do you have an example?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先,adaboost 是一种元算法,与您最喜欢的分类器结合使用(在其之上)。其次,在一个问题领域中表现良好的分类器通常在另一个问题领域中表现不佳。请参阅没有免费的午餐维基百科页面。因此,您的问题不会有答案。尽管如此,了解人们在实践中使用什么可能会很有趣。
First, adaboost is a meta-algorithm which is used in conjunction with (on top of) your favorite classifier. Second, classifiers which work well in one problem domain often don't work well in another. See the No Free Lunch wikipedia page. So, there is not going to be AN answer to your question. Still, it might be interesting to know what people are using in practice.
Weka 和 Mahout 不是算法……它们是机器学习库。它们包括各种算法的实现。因此,最好的选择是选择一个库并尝试几种不同的算法,看看哪种算法最适合您的特定问题(其中“最有效”将是训练成本、分类成本和分类准确性的函数)。
如果是我,我会从朴素贝叶斯、k 最近邻和支持向量机开始。它们代表了行之有效、易于理解的方法,但权衡却截然不同。朴素贝叶斯很便宜,但不是特别准确。 K-NN 在训练期间很便宜,但在分类期间(可能)很昂贵,而且虽然它通常非常准确,但很容易受到过度训练的影响。 SVM 的训练成本很高,并且需要调整大量元参数,但它们的应用成本很低,而且通常至少与 k-NN 一样准确。
如果您告诉我们更多有关您要解决的问题的信息,我们也许能够提供更有针对性的建议。但如果你只是在寻找一种真正的算法,那么就没有一种算法——没有免费的午餐定理可以保证这一点。
Weka and Mahout aren't algorithms... they're machine learning libraries. They include implementations of a wide range of algorithms. So, your best bet is to pick a library and try a few different algorithms to see which one works best for your particular problem (where "works best" is going to be a function of training cost, classification cost, and classification accuracy).
If it were me, I'd start with naive Bayes, k-nearest neighbors, and support vector machines. They represent well-established, well-understood methods with very different tradeoffs. Naive Bayes is cheap, but not especially accurate. K-NN is cheap during training but (can be) expensive during classification, and while it's usually very accurate it can be susceptible to overtraining. SVMs are expensive to train and have lots of meta-parameters to tweak, but they are cheap to apply and generally at least as accurate as k-NN.
If you tell us more about the problem you're trying to solve, we may be able to give more focused advice. But if you're just looking for the One True Algorithm, there isn't one -- the No Free Lunch theorem guarantees that.
Apache Mahout(开源,java)似乎获得了很大的发展。
Apache Mahout (open source, java) seems to pick up a lot of steam.
Weka 是一个非常流行且稳定的机器学习库。它已经存在很长一段时间了,并且是用
Java
编写的。Weka is a very popular and stable Machine Learning library. It has been around for quite a while and written in
Java
.哈斯蒂等人。 (2013,统计学习的要素)得出的结论是,梯度提升机是最好的“现成”方法。与您遇到的问题无关。
定义(参见第 352 页):
“现成”方法是指
可以直接应用于数据,而不需要大量耗时的数据预处理或仔细调整学习过程。
还有一个更古老的意思:
事实上,Breiman(NIPS Workshop,1996)将带有树的 AdaBoost 称为“世界上最好的现成分类器”(另请参见 Breiman(1998))。
Hastie et al. (2013, The Elements of Statistical Learning) conclude that the Gradient Boosting Machine is the best "off-the-shelf" Method. Independent of the Problem you have.
Definition (see page 352):
An “off-the-shelf” method is one that
can be directly applied to the data without requiring a great deal of timeconsuming data preprocessing or careful tuning of the learning procedure.
And a bit older meaning:
In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)).