决策树剪枝的效果
我想知道我是否从训练和验证集中建立了一个像 ID3 这样的决策树 A,但 A 没有被修剪。 同时,我在ID3中还有另一个决策树B,它是从相同的训练和验证集生成的,但B被修剪了。 现在我在未来未标记的测试集上测试 A 和 B,是否总是剪枝树会表现更好? 欢迎任何想法,谢谢。
I want to know if I build up a decision tree A like ID3 from training and validation set,but A is unpruned.
At the same time,I have another decision tree B also in ID3 generated from the same training and validation set,but B is pruned.
Now I test both A and B on a future unlabeled test set,is it always the case that pruned tree will perform better?
Any idea is welcomed,thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为我们需要使区别更清楚:修剪后的树在验证集上总是表现更好,但在测试集上不一定如此(事实上它也具有相同的性能)或者训练集上的表现较差)。我假设修剪是在树构建后完成的(即:后修剪)。
请记住,使用验证集的全部原因是为了避免对训练数据集的过度拟合,并且这里的关键点是泛化:我们想要一个模型(决策树),它能够超越“训练时”提供的实例泛化到新的未见过的示例。
I think we need to make the distinction clearer: pruned trees always perform better on the validation set, but not necessarily so on the testing set (in fact it is also of equal or worse performance on the training set). I am assuming that the pruning is done after the tree is built (ie: post-pruning)..
Remember that the whole reason of using a validation set is to avoid overfitting over the training dataset, and the key point here is generalization: we want a model (decision tree) that generalizes beyond the instances that have been provided at "training time" to new unseen examples.
修剪应该通过防止过度拟合来改进分类。由于只有在提高验证集的分类率时才会进行剪枝,因此在验证过程中,剪枝树的性能将与未剪枝的树一样好,甚至更好。
Pruning is supposed to improve classification by preventing overfitting. Since pruning will only occur if it improves classification rates on the validation set, a pruned tree will perform as well or better than an un-pruned tree during validation.
错误的修剪可能会导致错误的结果。尽管通常需要减小决策树的大小,但修剪时通常会追求更好的结果。因此如何是修剪的关键。
Bad pruning can lead to wrong results. Although a reduced decision tree size is often desired, you usually aim for better results when pruning. Therefore the how is the crux of the pruning.
我同意@AMRO 的第一个答案。
后剪枝
是决策树剪枝的最常见方法,它是在树构建之后完成的。但是,也可以进行预修剪
。在预剪枝中,通过使用指定的阈值提前停止其构造来剪枝树。例如,决定不在给定节点处分割训练元组的子集。然后该节点成为叶子。该叶子可以保存元组子集中最频繁的类别或这些元组的概率。
I agree with 1st answer by @AMRO.
Post-pruning
is the most common approach for decision tree pruning and it is done after the tree is built. But,Pre-pruning
can also be done. inpre-pruning
, a tree is pruned by halting its construction early, by using a specified threshold value. For example, by deciding not to split the subset of training tuples at a given node.Then that node becomes a leaf. This leaf may hold the most frequent class among the subset of tuples or the probability of those tuples.