求 LM 预测值上的库克距离
问题
我想使用库克距离来识别预测数据中的异常值。
背景
我知道使用 cooks.distance()
在用于构建线性模型的原始数据中很容易找到异常值(如下面的示例 1 所示)。
问题的更多解释
当我将新数据与该模型相匹配(使用 predict()
)时,我看不到如何获取新点上的 Cook 距离,因为 cooks.distance()
仅对 model
对象进行操作。我知道它是通过留一法迭代重建模型来计算的,所以也许根据拟合值计算它没有意义,但我希望我错过了一些关于如何处理这个问题的简单信息。
期望的输出
在下面的示例 2 中,我显示了预测值,我想通过 Cook's D 突出显示异常值,但由于我不知道该怎么做,所以我只是使用它们的残差来说明接近我想要的输出。
示例 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
示例 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
由 reprex 包于 2022 年 3 月 10 日创建 (v2.0.1)
Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance()
(illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()
), I can't see how to get the Cook's distance on the new points since cooks.distance()
only operates on a model
object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论