构建决策树时的停止条件
我正在为决策树编写自己的代码。我需要决定何时终止树构建过程。我想过限制树的高度,但这似乎微不足道。谁能告诉我如何定义停止条件的更好想法?
I am writing my own code for a decision tree. I need to decide on when to terminate the tree building process. I thought of limiting the height of the tree, but this seems trivial. Could anyone give me a better idea on how to define the stopping condition?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的问题中几乎没有上下文,但我假设您正在从大量数据中构建一棵树?在这种情况下,除了“LearnSet”之外,还有一个解决方案,即采用“StopSet”示例并定期验证您在此 StopSet 上的决策过程。如果质量下降,则表明您在 LearnSet 上训练过度。
我故意使用“StopSet”而不是“TestSet”,因为在此之后您应该在 TestSet 上应用决策树来评估真实质量。
There is little context in your question, but I assume your are constructing a tree from a large set of data? In that case, a solution is in addition to a "LearnSet" to take a "StopSet" of examples and regularly verify your decision making process on this StopSet. If quality decreases, this is an indication that your are overtraing on the LearnSet.
I deliberately use "StopSet" and not "TestSet" because after this you should apply your decision tree on the TestSet to assess the real quality.
由于决策树会产生不平衡的分裂,因此树的一部分可能比另一部分重。因此,使用树的高度并不明智,因为树的高度在任何地方都停止在同一高度。更好的方法是使用分割搜索所需的最少数量的观测值。这是更详细的
As a decision tree produces imbalanced splits, one part of the tree can be heavier than the other part. Hence it is not intelligent to use the height of the tree because this stops everywhere at the same level. Far better is to use the minimal number of observations required for a split search. This is more elaborated at http://zyxo.wordpress.com/2011/07/04/how-to-use-the-settings-to-control-the-size-of-decision-trees/
这是一个有点老的问题,但我想我可以改进答案。理论是,当分割是纯的(即杂质 = 0)或左侧或右侧节点中的所有成员具有相同的输出值时,您将停止。例如,如果您试图预测心脏病发作与否,并且在给定的分组中,如果一个组全部心脏病发作或没有心脏病发作,那么您可以安全地停止对该组的分组,因为一切都是相同的,您可以安全地预测共同的价值观。该理论得到了修剪过程的支持,因为您可以构建一棵非常高的树,如果一个节点对准确性没有贡献,它就会被修剪。
现在你很难得到完全纯粹的分裂。通常,为了将数据分割成完全纯的集合,您将分割很多数据,形成越来越小的集合,直到您在每个节点中获得单个观察结果。高大的树木通常无法在修剪过程中幸存下来,而且无论如何,您很可能会过度拟合训练数据。因此,通常的做法是在修剪算法中节省额外的时间,以简单地限制您愿意分割的观测值数量,并设置分割结果的最小值。您不会保留导致 1 和 999 个观察值的分割。这是一个糟糕的分裂,再试一次。
因此,您可以添加一个配置参数,用于确定节点中的最小观测数(即分割后)和分割所需的最小节点数(分割前),这些参数可以由用户调整。
最终标准还在于您的分裂是否比上次纯度测量没有改善。如果一个节点不能被分割以产生比以前更纯净的集合。你可以停下来,因为你走错了方向。这本质上意味着 I(s) 是否是您要分裂的节点的纯度测量值。 I(s[l]) 是左分割集的纯度,I(s[r]) 是右分割集的纯度,p(s) 是该集到父集的部分,则:
如果 Gain < 则停止。 0 因为你无法通过分割它来获得更多的纯度。
This is a somewhat old question, but I think I can improve the answers. The theory is you stop when a split is pure (ie impurity = 0) or all members in the left or right node are the same output value. For example, if you are trying to predict heart attack or not, and on a given split if a group has all heart attacks or no heart attack then you can safely stop splitting on that group because everything is the same and you can safely predict that common value. That theory is supported by the pruning process because you can build a very tall tree and if a node doesn't contribute to accuracy it gets pruned.
Now its rare that you get entirely pure splits. And often in order to split the data into entirely pure sets you'll split a lot making smaller and smaller sets until you get to a single observation in each node. Tall trees usually won't survive the pruning process and you are more than likely overfitting the training data anyway. So it's common practice to save yourself the extra time in the pruning algorithm to simply limit the number of observations you are willing to split on, and set a minimum on the number from a resulting split. You aren't going to keep a split that results in 1 and 999 observations. That's a bad split, try again.
So you add a config parameter for minimum number of observations in a node (ie after a split) and minimum number of nodes required for a split (before you split) that can be tweaked by the user.
The final criteria is also if you're splits don't improve from the last measurement of purity. If a node can't be split as to produce a more pure set then what it had before. You can stop because you're going in the wrong direction. Which essentially means if I(s) is the purity measurement of the node you are splitting. And I(s[l]) is the purity of the left split set, I(s[r]) is the purity of the right split set, and p(s) is the portion of that set to the parent set then:
And you stop if that Gain < 0 because you aren't getting anymore purity by splitting it.
我们有 3 个一般终止条件:
1.partition中的所有元组都属于同一类
2.没有可以进一步划分元组的剩余属性。
3. 给定分支没有元组。
当然,您可以设置一些其他条件,例如最大深度或其他条件。
We have 3 general terminate conditions:
1. All of the tuples in partition belong to the same class
2.There are no remaining attributes on which the tuples may be further partitioned.
3. There are no tuples for a given branch.
and of course you can make some other conditions like max depth or something else.