计算包含缺失值的相关系数

发布于 2024-12-19 02:09:23 字数 215 浏览 3 评论 0原文

我正在寻找计算 R 中某种形式的相关系数（或实际上任何常见的统计数据包），其中相关值受到缺失值的影响。我不确定这是否可行，正在寻找一种方法。我不想估算数据，但实际上希望根据以某种系统方式包含的不完整案例的数量来减少相关性。这些数据是由不同个体生成的一系列时间点，相关系数用于计算可靠性。在许多情况下，一个人的数据将比另一个人包含更多的时间点......

同样，不确定是否有任何标准程序来处理这种情况。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情仇皆在手 2024-12-26 02:09:23

需要注意的一件事是用逻辑回归拟合是否有一个点缺失。如果不存在关系，则可以支持假设缺失值不会提供任何信息。如果这是您的情况，那么您不必进行任何插补，只需执行计算即可，而不会丢失值。 R 中的glm 可用于逻辑回归。

另外，请参阅 cor 的 use="pairwise.complete.obs" 参数，它可能适用于您，也可能不适用于您。

编辑：我根据重读问题修改了这个答案。

回复收藏 0 原文

不回头走下去 2024-12-26 02:09:23

我的感觉是，当有一个数据对的其中一个时间序列显示 NA 时，该数据对不能用于计算相关性，因为此时没有信息。由于没有关于这一点的信息，因此无法知道它将如何影响相关性。指定 NA 降低相关性似乎很棘手，如果在某个点存在观察，这可能很容易改善相关性。

R 中的默认行为是，如果存在 NA，则返回相关性 NA。可以使用“use”参数来调整此行为。有关更多详细信息，请参阅该函数的文档。

回复收藏 0 原文

笑红尘 2024-12-26 02:09:23

正如 Paul Hiemstra 在回答中指出的那样，在没有缺失值的情况下，无法知道相关性是否会更高或更低。然而，对于某些应用程序，对不匹配缺失值观察到的相关性进行惩罚可能是适当的。例如，如果我们比较两个单独的编码器，当且仅当编码器 A 也说“NA”时，我们可能希望编码器 B 说“NA”，而且我们希望它们的非 NA 值相关。

在这些假设下，惩罚不匹配缺失值的一种简单方法是计算完整案例的相关性，并乘以在 NA 状态方面匹配的观测值的比例。惩罚项可以定义为：1 -mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB) ））。下面是一个简单的说明。

fun = function(x1, x2, idx_rm) {
  temp = x2
  # remove 'idx_rm' points from x2
  temp[idx_rm] = NA

  # calculate correlations
  r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
  r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
  penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
                       (!is.na(temp) & is.na(x1)))
  r_pen = round(r_NA * penalty, 2)

  # plot
  plot(x1, temp, main = paste('r_full =', r_full, 
                              '; r_NA =', r_NA,
                              '; r_pen =', r_pen),
       xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
  points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)

  regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
  regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
  abline(regr_full[1], regr_full[2])
  abline(regr_NA[1], regr_NA[2], lty = 2)
}

运行一个简单的模拟来说明缺失值和惩罚可能产生的影响：

set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation, 
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))

# A case when NA's DEflate the correlation, 
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))

# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)

# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.

Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.

fun = function(x1, x2, idx_rm) {
  temp = x2
  # remove 'idx_rm' points from x2
  temp[idx_rm] = NA

  # calculate correlations
  r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
  r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
  penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
                       (!is.na(temp) & is.na(x1)))
  r_pen = round(r_NA * penalty, 2)

  # plot
  plot(x1, temp, main = paste('r_full =', r_full, 
                              '; r_NA =', r_NA,
                              '; r_pen =', r_pen),
       xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
  points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)

  regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
  regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
  abline(regr_full[1], regr_full[2])
  abline(regr_NA[1], regr_NA[2], lty = 2)
}

Run a simple simulation to illustrate the possible effects of missing values and penalization:

set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation, 
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))

# A case when NA's DEflate the correlation, 
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))

# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)

# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

回复收藏 0 原文

~没有更多了~