Scikit-Learn：使用平均值的MAE标准（而不是中位数）

发布于 2025-02-06 06:35:49 字数 1307 浏览 1 评论 0原文

Scikit-Learn的absolute_error决策树和随机森林的标准（即Mae Criterion类： https://github.com/scikit-learn/scikit-learn/scikit-learn/scikit-learn/blob/blob/main/sklearn/sklearn/tree/_criterion.pyx ）与默认的Squared_error标准相比。

请参阅此处的讨论： https://github.com/scikit-learn/scikit-learn/scikit-学习/问题/9626

我正在使用一个太大而无法合理地使用MAE的数据集，但是，我想对MAE进行一些实验，或者至少对其进行近似可能的。阅读有关MAE的工作原理，我知道它是基于使用单个叶子的中位数而不是平均值，这是使其比MSE相比缩放较差的原因。

基于对决策树训练过程的难以置信的理解，我认为我可能能够修改MSE标准类以获得MAE的近似值。具体来说，如果MSE使用平方错误，我认为在那里的某个地方，我可以将平方根应用于现有计算以获取绝对错误。

例如，MSE类中的以下内容（请参阅第一个链接）：

for k in range(self.n_outputs):
    impurity_left[0] -= (self.sum_left[k] / self.weighted_n_left) ** 2.0
    impurity_right[0] -= (self.sum_right[k] / self.weighted_n_right) ** 2.0

可能会变成：

for k in range(self.n_outputs):
    impurity_left[0] -= ((self.sum_left[k] / self.weighted_n_left) ** 2.0)**0.5
    impurity_right[0] -= ((self.sum_right[k] / self.weighted_n_right) ** 2.0)**0.5

但是，我所有的实验都导致单个树估计器不适合一片叶子，因此预测了所有样本的相同值。

我只是想知道这种方法是否确实有意义，如果是这样，我需要修改以使其正常工作。

原文

Scikit-learn's absolute_error criterion for decision trees and random forests (i.e., the MAE Criterion class: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tree/_criterion.pyx) scales poorly compared to the default squared_error criterion.

See discussion here: https://github.com/scikit-learn/scikit-learn/issues/9626

I'm working with a dataset that is too large to reasonably be able to use MAE, however, I'd like to do a little experimentation with MAE, or at least an approximation of it if possible. Reading about how MAE works, I understand that it's based on using the median of individual leaves rather than the mean, which is what causes it to scale poorly compared to MSE.

Based on an incredibly shallow understanding of how the decision tree training process works, I would assume that I might be able to modify the MSE Criterion class to get an approximation to MAE. Specifically, if MSE uses squared error, I would think that somewhere in there, I could just apply square roots to existing calculations to get absolute error.

For example, something like the following in the MSE class (see the first link):

for k in range(self.n_outputs):
    impurity_left[0] -= (self.sum_left[k] / self.weighted_n_left) ** 2.0
    impurity_right[0] -= (self.sum_right[k] / self.weighted_n_right) ** 2.0

might become:

for k in range(self.n_outputs):
    impurity_left[0] -= ((self.sum_left[k] / self.weighted_n_left) ** 2.0)**0.5
    impurity_right[0] -= ((self.sum_right[k] / self.weighted_n_right) ** 2.0)**0.5

However, all of my experimentation is leading to individual tree estimators that don't fit beyond one leaf and therefore predict the same value for all samples.

I'm just wondering whether this approach actually makes sense and, if so, what I would need to modify to make it work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魂归处 2025-02-13 06:35:51

我认为这里有一些混乱。首先，关于MAE vs MSE的缩放，MSE可以缩放$ O（n）$，而根据您链接的GitHub问题，Scikit-Learn scales $ O（n^2）$，但Mae 可以可实施以缩放$ o（n*log（n））$（有关详细信息，请参见有关您提供的链接中单位测试失败的PR的讨论）。因此，如果您想要有效的MAE，则可以自己实施一个并获得更快的运行时。

关于您的尝试/修改均方根错误的尝试/问题，我认为您有点误导。该错误是平方的，以解释计算出的差异可能是正值或负数的事实，并且当您从0偏离0时，错误应该是单调增加的，而$ x $和$ -x $的偏差相等。结果，不幸的是，您不能仅仅“取平方根” - 它将无法正常工作，并且可以产生不直觉的结果。

附带说明，希望您会发现