从相关系数计算中删除异常值
假设我们有两个数值向量x
和y
。 x
和 y
之间的皮尔逊相关系数由下式给出
cor(x, y)
如何在计算中自动考虑 x 和 y 的子集(例如 90%)以最大化相关系数?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
假设我们有两个数值向量x
和y
。 x
和 y
之间的皮尔逊相关系数由下式给出
cor(x, y)
如何在计算中自动考虑 x 和 y 的子集(例如 90%)以最大化相关系数?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
如果您确实想要这样做(删除最大(绝对)残差),那么我们可以使用线性模型来估计最小二乘解和相关残差,然后选择中间 n% 的数据。这是一个例子:
首先,生成一些虚拟数据:
接下来,我们拟合线性模型并提取残差:
quantile()
函数可以为我们提供所需的残差分位数。您建议保留 90% 的数据,因此我们需要上限和下限 0.05 分位数:选择那些在中间 90% 数据中具有残差的观测值:
然后我们可以将其可视化,其中红点是我们将保留的点:
< img src="https://i.sstatic.net/gaOp1.png" alt="根据虚拟数据生成的图,显示具有最小残差的所选点">
完整数据和所选子集的相关性为:
请注意,这里我们可能会丢弃非常好的数据,因为我们只选择具有最大正残差的 5% 和具有最大负残差的 5%。另一种方法是选择绝对残差最小的 90%:
这个子集略有不同,相关性稍低:
另一点是,即使如此,我们也会丢弃好的数据。您可能希望将库克距离视为异常值强度的度量,并仅丢弃那些高于特定阈值库克距离的值。 维基百科提供有关库克距离和建议阈值的信息。
cooks.distance()
函数可用于检索mod
中的值:如果您计算维基百科上建议的阈值并仅删除超出阈值的阈值临界点。对于这些数据:
库克的距离都没有超过建议的阈值(考虑到我生成数据的方式,这并不奇怪。)
说了这么多,为什么要这样做呢?如果你只是想摆脱数据来改善相关性或产生重要的关系,这听起来有点可疑,对我来说有点像数据挖掘。
If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:
Firstly, generate some dummy data:
Next, we fit the linear model and extract the residuals:
The
quantile()
function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:Select those observations with residuals in the middle 90% of the data:
We can then visualise this, with the red points being those we will retain:
The correlations for the full data and the selected subset are:
Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:
With this slightly different subset, the correlation is slightly lower:
Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The
cooks.distance()
function can be used to retrieve the values frommod
:and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:
none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)
Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.
在
cor
中使用method = "spearman"
对污染具有鲁棒性,并且易于实现,因为它只涉及替换cor(x, y)
与cor(x, y, method = "spearman")。重复 Prasad 的分析,但使用 Spearman 相关性,我们发现 Spearman 相关性确实对此处的污染具有鲁棒性,恢复了潜在的零相关性:
Using
method = "spearman"
incor
will be robust to contamination and is easy to implement since it only involves replacingcor(x, y)
withcor(x, y, method = "spearman")
.Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:
这对OP来说可能已经很明显了,但只是为了确保......你必须小心,因为尝试最大化相关性实际上可能会包括异常值。 (@Gavin 在他的回答/评论中谈到了这一点。)我将首先删除异常值,然后计算相关性。更一般地说,我们希望计算对异常值具有鲁棒性的相关性(R 中有很多这样的方法)。
为了戏剧性地说明这一点,让我们创建两个不相关的向量
x
和y
:现在让我们添加一个离群点
(500,500)
:现在包含离群点的任何子集的相关性将接近100%,而排除离群点的任何足够大的子集的相关性将接近于零。特别是,
如果您想估计对异常值不敏感的“真实”相关性,您可以尝试使用
robust
包:您可以使用
covRob
的参数来决定如何修剪数据。更新:
MASS
包中还有rlm
(鲁棒线性回归)。This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).
Just to illustrate this dramatically, let's create two vectors
x
andy
that are uncorrelated:Now let's add an outlier point
(500,500)
:Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,
If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the
robust
package:You can play around with parameters of
covRob
to decide how to trim the data.UPDATE: There is also the
rlm
(robust linear regression) in theMASS
package.这是捕获异常值的另一种可能性。使用与 Prasad 类似的方案:
在其他答案中,500 作为异常值卡在 x 和 y 的末尾。这可能会或可能不会导致您的机器出现内存问题,因此我将其降至 4 以避免这种情况。
以下是来自 x1、y1、xy1 数据的图像:
Here's another possibility with the outliers captured. Using a similar scheme as Prasad:
In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.
Here are the images from the x1, y1, xy1 data:
您可以尝试引导数据以找到最高相关系数,例如:
运行
max(boot.cor)
后。如果所有相关系数都相同,请不要失望:)You might try bootstrapping your data to find the highest correlation coefficient, e.g.:
And after run
max(boot.cor)
. Do not be dissapointed if all the correlation coefficients will be all the same :)