剪枝决策树
当训练集中的示例太少时,如何使用 ID3 修剪决策树构建。
我无法将其分为训练集、验证集和测试集,所以这是不可能的。
有没有可以使用的统计方法或类似的方法?
How to prune decision tree build with ID3 when there are too few examples in the training set.
I cannot divide it into training, validation and test set, so that is out of the question.
Are there any statistical methods that might be used or something like that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,当您的数据量较少时,可以使用交叉验证来训练和修剪您的数据集。这个想法相当简单。您将数据分为 N 组,并用其中的 N-1 组训练您的树。您用作修剪测试集的最后一组。然后你从 N 组中选择另一组省略,并做同样的事情。重复此操作,直到排除所有组。这意味着您将构建 N 棵树。您将使用这 N 棵树来计算树的最佳大小,然后使用计算结果对完整数据集进行训练以修剪该树。它比我在这里可以有效描述的更复杂,但这里有一篇关于如何使交叉验证适应 ID3 的文章。
决策树交叉验证
已经进行了大量研究交叉验证的正确分段,并且发现 N=10 在给定的额外处理时间下给出了最佳结果。交叉验证会大大增加计算时间(N倍),但是当数据量较小时,它可以克服样本数量较少的问题。由于您没有大量数据,这意味着使用交叉验证在计算上并不是那么糟糕。
Yes when you have low amounts of data cross-validation can be used to train and prune your dataset. The idea is fairly simple. You divide your data into N sets, and train your tree with N-1 of them. The last set you use as your pruning test set. Then you pick another set on of the N sets to leave out, and do the same thing. Repeat this until you've left out all sets. That means you'll have built N trees. You'll use these N trees to calculate an optimal size of the tree, then train on the full set of data using the calculation to prune that tree. It's more complex than I can effectively describe here, but here is an article about how to adapt cross validation to ID3.
Decision Tree Cross Validation
Lots of research has been conducted on what the proper segmentation of cross validation, and it's been found N=10 gives the best results for the given extra processing time. Cross Validation increases your computation time by a lot (well N times), but when you have smaller amounts of data it can overcome the small number of samples. And since you don't have a lot of data that means using cross validation isn't that bad computationally.