纠正收集数据中的已知偏差
好的,所以这是一个与我的问题类似的问题(我将在下面详细说明真正的问题,但我认为这个类比会更容易理解)。
我有一枚奇怪的双面硬币,每抛 1,001 次,只会(随机)出现 1 次正面(其余均为反面)。 换句话说,我每看到 1000 个尾巴,就会有 1 个正面。
我有一种特殊的疾病,我看到的每 1,000 个反面中只有 1 个,但我注意到每个正面,所以在我看来,注意到正面或反面的比率是 0.5。 当然,我知道这种疾病及其影响,所以我可以弥补它。
现在有人给了我一枚新硬币,我注意到正面朝上的几率现在是 0.6。 鉴于我的疾病没有改变(我仍然只注意到每 1,000 个反面中就有 1 个),我如何计算这枚新硬币产生的正面与反面的实际比例?
好吧,那么真正的问题是什么? 好吧,我有一堆数据,由输入和输出(1 和 0)组成。 我想教一种监督机器学习算法来预测给定输入的预期输出(0 到 1 之间的浮点数)。 问题是 1 非常罕见,这会搞乱内部数学,因为它非常容易受到舍入误差的影响 - 即使使用高精度浮点数学也是如此。
因此,我通过随机省略大部分 0 个训练样本来对数据进行归一化,这样看起来 1 和 0 的比例大致相等。 当然,这意味着现在机器学习算法的输出不再预测概率,即。 它现在预测的是 0.5,而不是应有的预测 0.001。
我需要一种方法将机器学习算法的输出转换回原始训练集中的概率。
作者注(2015-10-07):后来我发现这种技术俗称“下采样”
Ok, so here is a problem analogous to my problem (I'll elaborate on the real problem below, but I think this analogy will be easier to understand).
I have a strange two-sided coin that only comes up heads (randomly) 1 in every 1,001 tosses (the remainder being tails). In other words, for every 1,000 tails I see, there will be 1 heads.
I have a peculiar disease where I only notice 1 in every 1,000 tails I see, but I notice every heads, and so it appears to me that the rate of noticing a heads or tails is 0.5. Of course, I'm aware of this disease and its effect so I can compensate for it.
Someone now gives me a new coin, and I noticed that the rate of noticing heads is now 0.6. Given that my disease hasn't changed (I still only notice 1 in every 1,000 tails), how do I calculate the actual ratio of heads to tails that this new coin produces?
Ok, so what is the real problem? Well, I have a bunch of data consisting of input, and outputs which are 1s and 0s. I want to teach a supervised machine learning algorithm to predict the expected output (a float between 0 and 1) given an input. The problem is that the 1s are very rare, and this screws up the internal math because it becomes very susceptible to rounding errors - even with high-precision floating point math.
So, I normalize the data by randomly omitting most of the 0 training samples so that it appears that there is a roughly equal ratio of 1s and 0s. Of course, this means that now the machine learning algorithm's output is no-longer predicting a probability, ie. instead of predicting 0.001 as it should, it would now predict 0.5.
I need a way to convert the output of the machine learning algorithm back to a probability within the original training set.
Author's Note (2015-10-07): I later discovered that this technique is commonly known as "downsampling"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在计算以下内容
,并且需要
求解两个尾部方程,得出以下方程。
将两者结合起来会产生以下结果。
最后求解 realRatio。
看来是正确的。 calculatedRatio 0.5 产生 realRatio 1/1001,0.6 产生 3 / 2003。
You are calculating the following
and you need
Solving both equations for tails yields the following equations.
Combining both yields the following.
And finally solving for realRatio.
Seems to be correct. calculatedRatio 0.5 yields realRatio 1/1001, 0.6 yields 3 / 2003.