决策树学习和杂质
测量杂质的方法有以下三种:
每种方法有哪些差异和适当的用例?
There are three ways to measure impurity:
What are the differences and appropriate use cases for each method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果 p_i非常很小,那么对非常小的数字(基尼指数)进行乘法可能会导致舍入误差。因此,最好添加日志(熵)。根据您的定义,分类误差提供了总体估计,因为它使用单个最大的 p_i 来计算其值。
If the p_i's are very small, then doing multiplication on very small numbers (Gini index) can lead to rounding error. Because of that, it is better to add the logs (Entropy). Classification error, following your definition, provides a gross estimate since it uses the single largest p_i to compute its value.
熵与其他杂质度量之间的区别,实际上通常是机器学习中信息论方法与其他方法之间的区别,在于熵已被数学证明可以捕获“信息”的概念。熵度量有许多分类定理(证明特定函数或数学对象是满足一组标准的唯一对象的定理),这些定理将哲学论证形式化,证明其作为“信息”度量的含义是合理的。
将此与其他方法(尤其是统计方法)进行对比,这些方法的选择不是因为它们的哲学理由,而是主要因为它们的经验理由——也就是说,它们似乎在实验中表现良好。它们之所以表现良好,是因为它们包含在实验时可能恰好成立的额外假设。
实际上,这意味着熵度量 (A) 在正确使用时不会过度拟合,因为它们不受有关数据的任何假设的影响;(B) 更有可能比随机度量表现得更好,因为它们可以推广到任何数据集,但是 (C) )特定数据集的性能可能不如采用假设的度量。
在决定在机器学习中使用哪些措施时,通常会归结为长期收益与短期收益以及可维护性。熵测量通常通过(A)和(B)长期有效,如果出现问题,更容易追踪并解释原因(例如获取训练数据时出现错误)。 (C) 的其他方法可能会带来短期收益,但如果它们停止工作,就很难区分,比如基础设施中的错误,数据发生真正的变化,而假设不再成立。
全球金融危机是模型突然失效的一个典型例子。银行家因短期收益而获得奖金,因此他们编写了短期表现良好的统计模型,而在很大程度上忽略了信息论模型。
The difference between entropy and other impurity measures, and in fact often the difference between information theoretic approaches in machine learning and other approaches, is that entropy has been mathematically proven to capture the concept of 'information'. There are many classification theorems (theorems that prove a particular function or mathematical object is the only object that satisfies a set of criteria) for entropy measures that formalize philosophical arguments justifying their meaning as measures of 'information'.
Contrast this with other approaches (especially statistical methods) that are chosen not for their philosophical justification, but primarily for their empirical justification - that is they seem to perform well in experiments. The reason why they perform well is because they contain additional assumptions that may happen to hold at the time of the experiment.
In practical terms this means entropy measures (A) can't over-fit when used properly as they are free from any assumptions about the data, (B) are more likely to perform better than random because they generalize to any dataset but (C) the performance for specific datasets might not be as good as measures that adopt assumptions.
When deciding which measures to use in machine learning it often comes down to long-term vs short-term gains, and maintainability. Entropy measures often work long-term by (A) and (B), and if something goes wrong it's easier to track down and explain why (e.g. a bug with obtaining the training data). Other approaches, by (C), might give short-term gains, but if they stop working it can be very hard to distinguish, say a bug in infrastructure with a genuine change in the data where the assumptions no longer hold.
A classic example where models suddenly stopped working is the global financial crisis. Bankers where being given bonuses for short term gains, so they wrote statistical models that would perform well short term and largely ignored information theoretic models.
我发现杂质测量的描述非常有用。除非您从头开始实现,否则大多数现有实现都使用单个预定的杂质度量。另请注意,基尼指数并不是杂质的直接衡量标准,在其原始公式中也不是这样,而且基尼指数比您上面列出的要多得多。
我不确定我是否理解对小数字和基尼杂质测量的担忧......我无法想象在分裂节点时会如何发生这种情况。
I found this description of impurity measures to be quite useful. Unless you are implementing from scratch, most existing implementations use a single predetermined impurity measure. Note also that the Gini index is not a direct measure of impurity, not in its original formulation, and that there are many more than what you list above.
I'm not sure that I understand the concern about small numbers and the Gini impurity measure... I can't imagine how this would happen when splitting a node.
我已经看到了对此的非正式指导的各种努力,从“如果您使用常用指标之一,就不会有太大差异”,到更具体的建议。事实上,确定哪种措施最有效的唯一方法是尝试所有候选措施。
无论如何,以下是 Salford Systems(CART 供应商)的一些观点:
I have seen various efforts at informal guidance on this, ranging from "if you use one of the usual metrics, there there won't be much difference", to much more specific recommendations. In reality, the only way to know with certainty which measure works best is to try all of the candidates.
Anyway, here is some perspective from Salford Systems (the CART vendor):
Do Splitting Rules Really Matter?