纠正收集数据中的已知偏差

发布于 2024-07-17 00:07:38 字数 744 浏览 5 评论 0原文

好的,所以这是一个与我的问题类似的问题(我将在下面详细说明真正的问题,但我认为这个类比会更容易理解)。

我有一枚奇怪的双面硬币,每抛 1,001 次,只会(随机)出现 1 次正面(其余均为反面)。 换句话说,我每看到 1000 个尾巴,就会有 1 个正面。

我有一种特殊的疾病,我看到的每 1,000 个反面中只有 1 个,但我注意到每个正面,所以在我看来,注意到正面或反面的比率是 0.5。 当然,我知道这种疾病及其影响,所以我可以弥补它。

现在有人给了我一枚新硬币,我注意到正面朝上的几率现在是 0.6。 鉴于我的疾病没有改变(我仍然只注意到每 1,000 个反面中就有 1 个),我如何计算这枚新硬币产生的正面与反面的实际比例?


好吧,那么真正的问题是什么? 好吧,我有一堆数据,由输入和输出(1 和 0)组成。 我想教一种监督机器学习算法来预测给定输入的预期输出(0 到 1 之间的浮点数)。 问题是 1 非常罕见,这会搞乱内部数学,因为它非常容易受到舍入误差的影响 - 即使使用高精度浮点数学也是如此。

因此,我通过随机省略大部分 0 个训练样本来对数据进行归一化,这样看起来 1 和 0 的比例大致相等。 当然,这意味着现在机器学习算法的输出不再预测概率,即。 它现在预测的是 0.5,而不是应有的预测 0.001。

我需要一种方法将机器学习算法的输出转换回原始训练集中的概率。

作者注(2015-10-07):后来我发现这种技术俗称“下采样”

Ok, so here is a problem analogous to my problem (I'll elaborate on the real problem below, but I think this analogy will be easier to understand).

I have a strange two-sided coin that only comes up heads (randomly) 1 in every 1,001 tosses (the remainder being tails). In other words, for every 1,000 tails I see, there will be 1 heads.

I have a peculiar disease where I only notice 1 in every 1,000 tails I see, but I notice every heads, and so it appears to me that the rate of noticing a heads or tails is 0.5. Of course, I'm aware of this disease and its effect so I can compensate for it.

Someone now gives me a new coin, and I noticed that the rate of noticing heads is now 0.6. Given that my disease hasn't changed (I still only notice 1 in every 1,000 tails), how do I calculate the actual ratio of heads to tails that this new coin produces?


Ok, so what is the real problem? Well, I have a bunch of data consisting of input, and outputs which are 1s and 0s. I want to teach a supervised machine learning algorithm to predict the expected output (a float between 0 and 1) given an input. The problem is that the 1s are very rare, and this screws up the internal math because it becomes very susceptible to rounding errors - even with high-precision floating point math.

So, I normalize the data by randomly omitting most of the 0 training samples so that it appears that there is a roughly equal ratio of 1s and 0s. Of course, this means that now the machine learning algorithm's output is no-longer predicting a probability, ie. instead of predicting 0.001 as it should, it would now predict 0.5.

I need a way to convert the output of the machine learning algorithm back to a probability within the original training set.

Author's Note (2015-10-07): I later discovered that this technique is commonly known as "downsampling"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

神爱温柔 2024-07-24 00:07:38

您正在计算以下内容

calculatedRatio = heads / (heads + tails / 1000)

,并且需要

realRatio = heads / (heads + tails)

求解两个尾部方程,得出以下方程。

tails = 1000 / calculatedRatio - 1000
tails = 1 / realRatio - 1

将两者结合起来会产生以下结果。

1000 / calculateRatio - 1000 = 1 / realRatio - 1

最后求解 realRatio。

realRatio = 1 / (1000 / calculatedRatio - 999)

看来是正确的。 calculatedRatio 0.5 产生 realRatio 1/1001,0.6 产生 3 / 2003。

You are calculating the following

calculatedRatio = heads / (heads + tails / 1000)

and you need

realRatio = heads / (heads + tails)

Solving both equations for tails yields the following equations.

tails = 1000 / calculatedRatio - 1000
tails = 1 / realRatio - 1

Combining both yields the following.

1000 / calculateRatio - 1000 = 1 / realRatio - 1

And finally solving for realRatio.

realRatio = 1 / (1000 / calculatedRatio - 999)

Seems to be correct. calculatedRatio 0.5 yields realRatio 1/1001, 0.6 yields 3 / 2003.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文