从相关系数计算中删除异常值

发布于 2024-10-11 13:42:11 字数 189 浏览 9 评论 0 原文

假设我们有两个数值向量xyxy 之间的皮尔逊相关系数由下式给出

cor(x, y)

如何在计算中自动考虑 x 和 y 的子集(例如 90%)以最大化相关系数?

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

cor(x, y)

How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

木緿 2024-10-18 13:42:11

如果您确实想要这样做(删除最大(绝对)残差),那么我们可以使用线性模型来估计最小二乘解和相关残差,然后选择中间 n% 的数据。这是一个例子:

首先,生成一些虚拟数据:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

接下来,我们拟合线性模型并提取残差:

res <- resid(mod <- lm(Y ~ X, data = dat))

quantile() 函数可以为我们提供所需的残差分位数。您建议保留 90% 的数据,因此我们需要上限和下限 0.05 分位数:

res.qt <- quantile(res, probs = c(0.05,0.95))

选择那些在中间 90% 数据中具有残差的观测值:

want <- which(res >= res.qt[1] & res <= res.qt[2])

然后我们可以将其可视化,其中红点是我们将保留的点:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

< img src="https://i.sstatic.net/gaOp1.png" alt="根据虚拟数据生成的图,显示具有最小残差的所选点">

完整数据和所选子集的相关性为:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

请注意,这里我们可能会丢弃非常好的数据,因为我们只选择具有最大正残差的 5% 和具有最大负残差的 5%。另一种方法是选择绝对残差最小的 90%:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

这个子集略有不同,相关性稍低:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

另一点是,即使如此,我们也会丢弃好的数据。您可能希望将库克距离视为异常值强度的度量,并仅丢弃那些高于特定阈值库克距离的值。 维基百科提供有关库克距离和建议阈值的信息。 cooks.distance() 函数可用于检索 mod 中的值:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

如果您计算维基百科上建议的阈值并仅删除超出阈值的阈值临界点。对于这些数据:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

库克的距离都没有超过建议的阈值(考虑到我生成数据的方式,这并不奇怪。)

说了这么多,为什么要这样做呢?如果你只是想摆脱数据来改善相关性或产生重要的关系,这听起来有点可疑,对我来说有点像数据挖掘。

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

The plot produced from the dummy data showing the selected points with the smallest residuals

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

浅浅淡淡 2024-10-18 13:42:11

cor 中使用 method = "spearman" 对污染具有鲁棒性,并且易于实现,因为它只涉及替换 cor(x, y)与cor(x, y, method = "spearman")。

重复 Prasad 的分析,但使用 Spearman 相关性,我们发现 Spearman 相关性确实对此处的污染具有鲁棒性,恢复了潜在的零相关性:

set.seed(1)

# x and y are uncorrelated
x <- rnorm(1000)
y <- rnorm(1000)
cor(x,y)
## [1] 0.006401211

# add contamination -- now cor says they are highly correlated
x <- c(x, 500)
y <- c(y, 500)
cor(x, y)
## [1] 0.995741

# but with method = "spearman" contamination is removed & they are shown to be uncorrelated
cor(x, y, method = "spearman")
## [1] -0.007270813

Using method = "spearman" in cor will be robust to contamination and is easy to implement since it only involves replacing cor(x, y) with cor(x, y, method = "spearman").

Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:

set.seed(1)

# x and y are uncorrelated
x <- rnorm(1000)
y <- rnorm(1000)
cor(x,y)
## [1] 0.006401211

# add contamination -- now cor says they are highly correlated
x <- c(x, 500)
y <- c(y, 500)
cor(x, y)
## [1] 0.995741

# but with method = "spearman" contamination is removed & they are shown to be uncorrelated
cor(x, y, method = "spearman")
## [1] -0.007270813
雪若未夕 2024-10-18 13:42:11

这对OP来说可能已经很明显了,但只是为了确保......你必须小心,因为尝试最大化相关性实际上可能会包括异常值。 (@Gavin 在他的回答/评论中谈到了这一点。)我将首先删除异常值,然后计算相关性。更一般地说,我们希望计算对异常值具有鲁棒性的相关性(R 中有很多这样的方法)。

为了戏剧性地说明这一点,让我们创建两个不相关的向量 xy

set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211

现在让我们添加一个离群点 (500,500)

x <- c(x, 500)
y <- c(y, 500)

现在包含离群点的任何子集的相关性将接近100%,而排除离群点的任何足够大的子集的相关性将接近于零。特别是,

> cor(x,y)
[1] 0.995741

如果您想估计对异常值不敏感的“真实”相关性,您可以尝试使用robust包:

require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000

您可以使用covRob的参数来决定如何修剪数据。
更新:MASS 包中还有 rlm(鲁棒线性回归)。

This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:

set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211

Now let's add an outlier point (500,500):

x <- c(x, 500)
y <- c(y, 500)

Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,

> cor(x,y)
[1] 0.995741

If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:

require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000

You can play around with parameters of covRob to decide how to trim the data.
UPDATE: There is also the rlm (robust linear regression) in the MASS package.

凉墨 2024-10-18 13:42:11

这是捕获异常值的另一种可能性。使用与 Prasad 类似的方案:

library(mvoutlier)    
set.seed(1)    
x <- rnorm(1000)    
y <- rnorm(1000)    
xy <- cbind(x, y)    
outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
cor.plot(x, y)    
color.plot(xy)   
dd.plot(xy)   
uni.plot(xy)    

在其他答案中,500 作为异常值卡在 x 和 y 的末尾。这可能会或可能不会导致您的机器出现内存问题,因此我将其降至 4 以避免这种情况。

x1 <- c(x, 4)     
y1 <- c(y, 4)    
xy1 <- cbind(x1, y1)    
outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
cor.plot(x1, y1)    
color.plot(xy1)    
dd.plot(xy1)    
uni.plot(xy1)    

以下是来自 x1、y1、xy1 数据的图像:

alt text

替代文字

替代文字

Here's another possibility with the outliers captured. Using a similar scheme as Prasad:

library(mvoutlier)    
set.seed(1)    
x <- rnorm(1000)    
y <- rnorm(1000)    
xy <- cbind(x, y)    
outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
cor.plot(x, y)    
color.plot(xy)   
dd.plot(xy)   
uni.plot(xy)    

In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.

x1 <- c(x, 4)     
y1 <- c(y, 4)    
xy1 <- cbind(x1, y1)    
outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
cor.plot(x1, y1)    
color.plot(xy1)    
dd.plot(xy1)    
uni.plot(xy1)    

Here are the images from the x1, y1, xy1 data:

alt text

alt text

alt text

浅暮の光 2024-10-18 13:42:11

您可以尝试引导数据以找到最高相关系数,例如:

x <- cars$dist
y <- cars$speed
percent <- 0.9         # given in the question above
n <- 1000              # number of resampling
boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})

运行 max(boot.cor) 后。如果所有相关系数都相同,请不要失望:)

You might try bootstrapping your data to find the highest correlation coefficient, e.g.:

x <- cars$dist
y <- cars$speed
percent <- 0.9         # given in the question above
n <- 1000              # number of resampling
boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})

And after run max(boot.cor). Do not be dissapointed if all the correlation coefficients will be all the same :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文