决策树学习和杂质

发布于 2024-10-16 14:41:01 字数 241 浏览 13 评论 0原文

测量杂质的方法有以下三种：

Entropy

基尼指数

分类错误

每种方法有哪些差异和适当的用例？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清晨说晚安 2024-10-23 14:41:01

如果 p_i非常很小，那么对非常小的数字（基尼指数）进行乘法可能会导致舍入误差。因此，最好添加日志（熵）。根据您的定义，分类误差提供了总体估计，因为它使用单个最大的 p_i 来计算其值。

回复收藏 0 原文

美人如玉 2024-10-23 14:41:01

熵与其他杂质度量之间的区别，实际上通常是机器学习中信息论方法与其他方法之间的区别，在于熵已被数学证明可以捕获“信息”的概念。熵度量有许多分类定理（证明特定函数或数学对象是满足一组标准的唯一对象的定理），这些定理将哲学论证形式化，证明其作为“信息”度量的含义是合理的。

将此与其他方法（尤其是统计方法）进行对比，这些方法的选择不是因为它们的哲学理由，而是主要因为它们的经验理由——也就是说，它们似乎在实验中表现良好。它们之所以表现良好，是因为它们包含在实验时可能恰好成立的额外假设。

实际上，这意味着熵度量 (A) 在正确使用时不会过度拟合，因为它们不受有关数据的任何假设的影响；(B) 更有可能比随机度量表现得更好，因为它们可以推广到任何数据集，但是 (C) ）特定数据集的性能可能不如采用假设的度量。

在决定在机器学习中使用哪些措施时，通常会归结为长期收益与短期收益以及可维护性。熵测量通常通过（A）和（B）长期有效，如果出现问题，更容易追踪并解释原因（例如获取训练数据时出现错误）。 (C) 的其他方法可能会带来短期收益，但如果它们停止工作，就很难区分，比如基础设施中的错误，数据发生真正的变化，而假设不再成立。

全球金融危机是模型突然失效的一个典型例子。银行家因短期收益而获得奖金，因此他们编写了短期表现良好的统计模型，而在很大程度上忽略了信息论模型。

The difference between entropy and other impurity measures, and in fact often the difference between information theoretic approaches in machine learning and other approaches, is that entropy has been mathematically proven to capture the concept of 'information'. There are many classification theorems (theorems that prove a particular function or mathematical object is the only object that satisfies a set of criteria) for entropy measures that formalize philosophical arguments justifying their meaning as measures of 'information'.

Contrast this with other approaches (especially statistical methods) that are chosen not for their philosophical justification, but primarily for their empirical justification - that is they seem to perform well in experiments. The reason why they perform well is because they contain additional assumptions that may happen to hold at the time of the experiment.

In practical terms this means entropy measures (A) can't over-fit when used properly as they are free from any assumptions about the data, (B) are more likely to perform better than random because they generalize to any dataset but (C) the performance for specific datasets might not be as good as measures that adopt assumptions.

When deciding which measures to use in machine learning it often comes down to long-term vs short-term gains, and maintainability. Entropy measures often work long-term by (A) and (B), and if something goes wrong it's easier to track down and explain why (e.g. a bug with obtaining the training data). Other approaches, by (C), might give short-term gains, but if they stop working it can be very hard to distinguish, say a bug in infrastructure with a genuine change in the data where the assumptions no longer hold.

A classic example where models suddenly stopped working is the global financial crisis. Bankers where being given bonuses for short term gains, so they wrote statistical models that would perform well short term and largely ignored information theoretic models.

回复收藏 0 原文