确定两个误差值之间的差异是否显着

发布于 2024-08-19 21:28:59 字数 305 浏览 8 评论 0原文

我正在评估许多不同的算法,其作用是预测事件发生的概率。

我正在大型数据集上测试算法。我使用“均方根误差”来衡量它们的有效性,它是((误差之和)平方)的平方根。误差是预测概率(0 到 1 之间的浮点值)与实际结果(0.0 或 1.0)之间的差异。

所以我知道 RMSE,以及测试算法的样本数量。

问题是,有时 RMSE 值彼此非常接近,我需要一种方法来确定它们之间的差异是否只是偶然,或者是否代表性能的实际差异。

理想情况下,对于给定的一对 RMSE 值,我想知道一个确实比另一个更好的概率是多少,以便我可以使用该概率作为显着性阈值。

I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.

I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).

So I know the RMSE, and also the number of samples that the algorithm was tested on.

The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.

Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

国产ˉ祖宗 2024-08-26 21:28:59

MSE 是平均值,因此适用中心极限定理。因此,测试两个 MSE 是否相同与测试两个均值是否相等是一样的。与比较两种均值的标准测试相比,一个困难在于您的样本是相关的——两者都来自相同的事件。但 MSE 的差异与差分平方误差的平均值相同(平均值是线性的)。这建议将单样本 t 检验计算为如下所示:

  1. 对于每个x,计算过程1和2的误差e
  2. 计算平方误差的差(e2^2-e1^2)
  3. 计算差异的平均值。
  4. 计算差异的标准差。
  5. 将 t 统计量计算为 mean/(sd/sqrt(n))
  6. 将 t 统计量与临界值进行比较或计算 p 值。例如,如果 |t|>1.96,则拒绝 5% 置信度下的相等。

RMSE 是 MSE 的单调变换,因此该测试不应给出实质性不同的结果。但请注意不要假设 MRSE 就是 RMSE。

更大的问题应该是过度拟合。确保使用未用于估计模型的数据来计算所有 MSE 统计数据。

The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:

  1. For each x compute a error e for procedure 1 and 2.
  2. Compute differences of squared errors (e2^2-e1^2).
  3. Compute the mean of the differences.
  4. Compute the standard deviation of the differences.
  5. Compute a t-statistic as mean/(sd/sqrt(n)).
  6. Compare your t-statistic to a critical value or compute a p-value. For instance, reject equality at 5% confidence level if |t|>1.96.

The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.

A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.

遇见了你 2024-08-26 21:28:59

你正在进入一个广阔而有争议的领域,不仅涉及计算,还涉及哲学。显着性检验和模型选择是贝叶斯主义者和频率主义者之间存在严重分歧的主题。特里斯顿关于将数据集分为训练集和验证集的评论不会让贝叶斯主义者满意。

我可以建议 RMSE 不是一个合适的概率分数。如果样本是独立的,则正确的分数是分配给实际结果的概率的对数之和。 (如果它们不是独立的,那么你的手上就会一片混乱。)我所描述的是对“插件”模型进行评分。正确的贝叶斯建模需要对模型参数进行积分,这在计算上极其困难。调节插件模型的贝叶斯方法是对不太可能的(大)模型参数的分数添加惩罚。这就是所谓的“重量衰减”。

我开始阅读 Christopher Bishop 的《用于模式识别的神经网络》,开始我的探索之旅。我使用它以及 Gill 等人的实用优化来编写对我来说非常有效的软件。

You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston's comment about splitting the data-set into training and verification sets would not please a Bayesian.

May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a "plug-in" model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That's been called "weight decay."

I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.

一抹淡然 2024-08-26 21:28:59

我在这里回应评论中的问题。这个主题太大了,无法在评论中处理。

悬崖笔记版本。

我们谈论的分数类型衡量的是概率。 (这是否适合您正在做的事情是另一个问题。)如果您假设样本是独立的,则只需将所有概率相乘即可获得“总”概率。但这通常会导致数字小得离谱,因此等效地,您将概率的对数相加。越大越好。零是完美的。

普遍存在的平方误差 -x^2(其中 x 是模型误差)来自(通常不合理的)假设:训练数据包含被“高斯噪声”破坏的观测值(测量值)。如果您查看维基百科或高斯(又名正态)分布的定义,您会发现它包含术语 e^(-x^2)。取其自然对数,瞧!-x^2。但是您的模型不会产生最有可能的“噪声前”测量值。它们直接产生概率。因此,要做的就是简单地将分配给观察到的事件的概率的对数相加。假设这些观察结果是无噪声的。如果训练数据表明它发生了,它就发生了。

你原来的问题仍然没有答案。如何判断两个模型是否“显着”不同?这是一个模糊且困难的问题。这是很多争论的主题,甚至是情感和怨恨的主题。这也不是您真正想要回答的问题。你想知道的是哪种模式可以给你带来最好的预期利润,考虑到所有因素,包括每个软件包的成本等等。

我很快就会结束这个问题。这不是建模和概率课程的地方,而且我也没有真正的资格作为教授。

I am responding here to questions in the comments. The subject is far too big to handle in comments.

Cliff Notes version.

The types of scores we are talking about measure probabilities. (Whether that is appropriate for what you are doing is another question.) If you assume that the samples are independent, you get the "total" probability by simply multiplying all the probabilities together. But that usually results in absurdly small numbers, so equivalently, you add the logarithms of the probabilities. Bigger is better. Zero is perfect.

The ubiquitous -squared error, -x^2, where x is the model's error, comes from the (frequently unjustified) assumption that the training data comprise observations (measurements) corrupted with "Gaussian noise." If you look in Wikipedia or something at the definition of a Gaussian (aka normal) distribution, you'll find that it contains the term e^(-x^2). Take the natural logarithm of that, and voila!, -x^2. But your models do not produce most-likely "pre-noise" values for measurements. They produce probabilities directly. So the thing to do is simply to add the logarithms of the probabilities assigned to the observed events. Those observations are assumed to be noise-free. If the training data says it happened, it happened.

Your original question remains unanswered. How to tell if two models differ "significantly"? That is a vague and difficult question. It is the subject of much debate and even emotion and rancor. It's also not really the question you want answered. What you want to know is which model gives you the best expected profit, all things considered, including how much each software package costs, etc.

I'll have to break this off soon. This is not the place for a course on modeling and probability, and I am not really qualified as the professor.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文