在机器学习中,可以采取哪些措施来限制所需训练样本的数量?
在许多应用中,创建大型训练数据集即使不是完全不可能,也是非常昂贵的。那么可以采取哪些步骤来限制获得良好精度所需的尺寸呢?
In many applications, creating a large training dataset can be very costly, if not outright impossible. So what steps can one take to limit the size that is needed for good accuracy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
嗯,机器学习的一个分支专门致力于解决这个问题(标记数据集的成本很高):semi -监督学习
老实说,根据我的经验,计算时间非常长,而且与完全标记的数据集相比,结果显得苍白……但是最好在大型未标记数据集上进行训练,而不是什么都没有!
编辑:嗯,我首先将问题理解为“标记数据集很昂贵”,而不是“无论如何,数据集的大小都会很小”
好吧,除其他外,我会:
使用 < 调整我的参数a href="http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#Common_types_of_cross-validation" rel="nofollow">留下一个交叉验证。计算成本最高,但却是最好的。
选择收敛速度相当快的算法。 (你需要一个比较表,我现在没有)
需要非常好的泛化属性。在这种情况下,弱分类器的线性组合非常好。 kNN(k 个最近邻)非常糟糕。
偏置“泛化”参数。大多数算法都是在泛化(规律性)和质量(训练集是否被分类器很好地分类?)之间进行折衷。如果您的数据集很小,您应该使算法偏向泛化(在通过交叉验证调整参数之后)
Well, there is a branch of machine learning specifically dedicated to solve this problem (labeling datasets is costly) : semi-supervised learning
Honestly, from my experience, the computation is quite horrendously long and the results pale in comparison with fully labeled datasets... But better train on a large unlabeled dataset rather than with nothing!
Edit : Well, I first understood the question as "Labeling a dataset is expensive" rather than "The size of the dataset will be small no matter what"
Well, among other things, I would :
Tune my parameters with the leave one out cross validation. The most computationnaly expensive, but the best one.
Choose algorithms that have a rather quick convergence. (You need a comparison table, which I do not have right now)
Need very good generalization properties. Linear combinations of weak classifiers are quite good in this case. kNN (k nearest neighbours) are extremely bad.
Bias the "generalization" parameter. Most algorithm consist in a compromise between generalization (regularity) and quality (is the training set well classified by the classifier?). If your dataset is small, you should bias the algorithm toward generalization (after tuning the parameters with cross validation)