计算包含缺失值的相关系数
我正在寻找计算 R 中某种形式的相关系数(或实际上任何常见的统计数据包),其中相关值受到缺失值的影响。我不确定这是否可行,正在寻找一种方法。我不想估算数据,但实际上希望根据以某种系统方式包含的不完整案例的数量来减少相关性。这些数据是由不同个体生成的一系列时间点,相关系数用于计算可靠性。在许多情况下,一个人的数据将比另一个人包含更多的时间点......
同样,不确定是否有任何标准程序来处理这种情况。
I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
需要注意的一件事是用逻辑回归拟合是否有一个点缺失。如果不存在关系,则可以支持假设缺失值不会提供任何信息。如果这是您的情况,那么您不必进行任何插补,只需执行计算即可,而不会丢失值。 R 中的glm 可用于逻辑回归。
另外,请参阅
cor
的use="pairwise.complete.obs"
参数,它可能适用于您,也可能不适用于您。编辑:我根据重读问题修改了这个答案。
One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values.
glm
in R can be used for logistic regression.Also on a different note, see the
use="pairwise.complete.obs"
argument tocor
which may or may not apply to you.EDIT: I have revised this answer based on rereading the question.
我的感觉是,当有一个数据对的其中一个时间序列显示 NA 时,该数据对不能用于计算相关性,因为此时没有信息。由于没有关于这一点的信息,因此无法知道它将如何影响相关性。指定 NA 降低相关性似乎很棘手,如果在某个点存在观察,这可能很容易改善相关性。
R 中的默认行为是,如果存在 NA,则返回相关性 NA。可以使用“use”参数来调整此行为。有关更多详细信息,请参阅该函数的文档。
My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.
正如 Paul Hiemstra 在回答中指出的那样,在没有缺失值的情况下,无法知道相关性是否会更高或更低。然而,对于某些应用程序,对不匹配缺失值观察到的相关性进行惩罚可能是适当的。例如,如果我们比较两个单独的编码器,当且仅当编码器 A 也说“NA”时,我们可能希望编码器 B 说“NA”,而且我们希望它们的非 NA 值相关。
在这些假设下,惩罚不匹配缺失值的一种简单方法是计算完整案例的相关性,并乘以在 NA 状态方面匹配的观测值的比例。惩罚项可以定义为:
1 -mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB) ))
。下面是一个简单的说明。运行一个简单的模拟来说明缺失值和惩罚可能产生的影响:
As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as:
1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB)))
. A simple illustration follows.Run a simple simulation to illustrate the possible effects of missing values and penalization: