求 LM 预测值上的库克距离

发布于 2025-01-12 21:31:46 字数 1596 浏览 2 评论 0原文

问题

我想使用库克距离来识别预测数据中的异常值。

背景

我知道使用 cooks.distance() 在用于构建线性模型的原始数据中很容易找到异常值(如下面的示例 1 所示)。

问题的更多解释

当我将新数据与该模型相匹配(使用 predict())时,我看不到如何获取新点上的 Cook 距离,因为 cooks.distance() 仅对 model 对象进行操作。我知道它是通过留一法迭代重建模型来计算的,所以也许根据拟合值计算它没有意义,但我希望我错过了一些关于如何处理这个问题的简单信息。

期望的输出

在下面的示例 2 中,我显示了预测值,我想通过 Cook's D 突出显示异常值,但由于我不知道该怎么做,所以我只是使用它们的残差来说明接近我想要的输出。

示例 1

# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a, 
     plot(mpg ~ disp, 
          col = pal[1 + round(100 * scale(c, min(c), max(c)))], 
          pch = 19,
          main = "Color by Cook's D")); abline(m)

示例 2

# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg

# visualize outliers in full data by residuals
with(b, 
     plot(mpg ~ disp, 
          pch = 19, 
          col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
          main = "Color by Residual")); abline(m)

reprex 包于 2022 年 3 月 10 日创建 (v2.0.1)

Problem

I would like to use Cook's distance to identify outliers in my predicted data.

Background

I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).

More Explanation of Problem

When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.

Desired Output

In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.

Example 1

# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a, 
     plot(mpg ~ disp, 
          col = pal[1 + round(100 * scale(c, min(c), max(c)))], 
          pch = 19,
          main = "Color by Cook's D")); abline(m)

Example 2

# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg

# visualize outliers in full data by residuals
with(b, 
     plot(mpg ~ disp, 
          pch = 19, 
          col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
          main = "Color by Residual")); abline(m)

Created on 2022-03-10 by the reprex package (v2.0.1)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文