Scikit-Learn:使用平均值的MAE标准(而不是中位数)
Scikit-Learn的absolute_error
决策树和随机森林的标准(即Mae Criterion类: https://github.com/scikit-learn/scikit-learn/scikit-learn/scikit-learn/blob/blob/main/sklearn/sklearn/tree/_criterion.pyx )与默认的Squared_error
标准相比。
请参阅此处的讨论: https://github.com/scikit-learn/scikit-learn/scikit-学习/问题/9626
我正在使用一个太大而无法合理地使用MAE的数据集,但是,我想对MAE进行一些实验,或者至少对其进行近似可能的。阅读有关MAE的工作原理,我知道它是基于使用单个叶子的中位数而不是平均值,这是使其比MSE相比缩放较差的原因。
基于对决策树训练过程的难以置信的理解,我认为我可能能够修改MSE标准类以获得MAE的近似值。具体来说,如果MSE使用平方错误,我认为在那里的某个地方,我可以将平方根应用于现有计算以获取绝对错误。
例如,MSE类中的以下内容(请参阅第一个链接):
for k in range(self.n_outputs):
impurity_left[0] -= (self.sum_left[k] / self.weighted_n_left) ** 2.0
impurity_right[0] -= (self.sum_right[k] / self.weighted_n_right) ** 2.0
可能会变成:
for k in range(self.n_outputs):
impurity_left[0] -= ((self.sum_left[k] / self.weighted_n_left) ** 2.0)**0.5
impurity_right[0] -= ((self.sum_right[k] / self.weighted_n_right) ** 2.0)**0.5
但是,我所有的实验都导致单个树估计器不适合一片叶子,因此预测了所有样本的相同值。
我只是想知道这种方法是否确实有意义,如果是这样,我需要修改以使其正常工作。
Scikit-learn's absolute_error
criterion for decision trees and random forests (i.e., the MAE Criterion class: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tree/_criterion.pyx) scales poorly compared to the default squared_error
criterion.
See discussion here: https://github.com/scikit-learn/scikit-learn/issues/9626
I'm working with a dataset that is too large to reasonably be able to use MAE, however, I'd like to do a little experimentation with MAE, or at least an approximation of it if possible. Reading about how MAE works, I understand that it's based on using the median of individual leaves rather than the mean, which is what causes it to scale poorly compared to MSE.
Based on an incredibly shallow understanding of how the decision tree training process works, I would assume that I might be able to modify the MSE Criterion class to get an approximation to MAE. Specifically, if MSE uses squared error, I would think that somewhere in there, I could just apply square roots to existing calculations to get absolute error.
For example, something like the following in the MSE class (see the first link):
for k in range(self.n_outputs):
impurity_left[0] -= (self.sum_left[k] / self.weighted_n_left) ** 2.0
impurity_right[0] -= (self.sum_right[k] / self.weighted_n_right) ** 2.0
might become:
for k in range(self.n_outputs):
impurity_left[0] -= ((self.sum_left[k] / self.weighted_n_left) ** 2.0)**0.5
impurity_right[0] -= ((self.sum_right[k] / self.weighted_n_right) ** 2.0)**0.5
However, all of my experimentation is leading to individual tree estimators that don't fit beyond one leaf and therefore predict the same value for all samples.
I'm just wondering whether this approach actually makes sense and, if so, what I would need to modify to make it work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这里有一些混乱。首先,关于MAE vs MSE的缩放,MSE可以缩放$ O(n)$,而根据您链接的GitHub问题,Scikit-Learn scales $ O(n^2)$,但Mae 可以可实施以缩放$ o(n*log(n))$(有关详细信息,请参见有关您提供的链接中单位测试失败的PR的讨论)。因此,如果您想要有效的MAE,则可以自己实施一个并获得更快的运行时。
关于您的尝试/修改均方根错误的尝试/问题,我认为您有点误导。该错误是平方的,以解释计算出的差异可能是正值或负数的事实,并且当您从0偏离0时,错误应该是单调增加的,而$ x $和$ -x $的偏差相等。结果,不幸的是,您不能仅仅“取平方根” - 它将无法正常工作,并且可以产生不直觉的结果。
附带说明,希望您会发现
I think there are a few bits of confusion here. Firstly, with regard to scaling of MAE vs MSE, MSE can scale $O(n)$ whereas according to the Github issue you linked, MAE in scikit-learn scales $O(n^2)$, but MAE could be implemented to scale $O(n*log(n))$ (for details see the discussion about the PR which failed the unit tests in the link you provided). So if you want an efficient MAE, you may be able to implement one yourself and get that faster runtime.
With regard to your attempt/issues modifying mean squared error, I think you are a bit misguided. The error is squared to account for the fact that the calculated difference could be positive or negative, and error should be something monotonically increasing as you stray from 0 and a deviation of $x$ and $-x$ are equal. As a result, unfortunately you can not just "take the square root" - it will not work and can have unintuitive results.
As a side note, hopefully you will find this valuable (from the linked issue in your question):