如何处理 R 中的多种缺失?
许多调查都有不同类型缺失的代码。例如,密码本可能指示:
0-99数据
-1 未提出问题
-5 不知道
-7拒绝回应
-9 未询问模块
Stata 有一个漂亮的工具来处理这些多种缺失,因为它允许您分配一个通用的 。缺失数据,但也允许更具体类型的缺失(.a、.b、.c、...、.z)。所有查看缺失报告的命令都会回答所有指定的缺失条目,但您也可以稍后整理出各种缺失。当您认为拒绝回答对插补策略的影响与未提出问题的影响不同时,这尤其有用。
我从未在 R 中遇到过这样的功能,但我真的很想拥有这种功能。有没有办法标记几种不同类型的 NA?我可以想象创建更多数据(包含缺失类型的长度为 nrow(my.data.frame) 的向量,或者包含缺失类型的行的更紧凑的索引),但这看起来相当笨重。
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我知道你在寻找什么,而这在 R 中没有实现。我不知道实现它的包,但自己编写代码并不太困难。
一种可行的方法是将包含代码的数据框添加到属性中。为了防止整个数据帧加倍并节省空间,我将在该数据帧中添加索引,而不是重建完整的数据帧。
例如:
这允许这样做:
还可以调整该函数以添加一个额外的属性,为您提供不同值的标签,另请参阅这个问题。您可以通过以下方式进行反向转换:
如果有必要,这允许仅更改您想要的代码。当没有给出参数时,该函数可以适用于返回所有代码。可以构建类似的函数来根据代码提取数据,我想您可以自己解决这个问题。
但一句话:使用属性和索引可能是一种很好的方法。
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
This allows to do :
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
最明显的方法似乎是使用两个向量:
c(2, 50, NA, NA)
factor(c(1, 1, -1, -7))
,其中因子1
表示正确回答的问题。拥有这种结构将为您带来很大的灵活性,因为所有标准
na.rm
参数仍然适用于您的数据向量,但您可以对因子向量使用更复杂的概念。更新@gsk3的以下问题
我应该声明,我从未分析过调查数据(尽管我分析过大型生物数据集)。我上面的回答看起来很防御性,但这不是我的本意。我认为你的问题提得很好,我对其他答案感兴趣。
The most obvious way seems to use two vectors:
NA
. For example,c(2, 50, NA, NA)
factor(c(1, 1, -1, -7))
where factor1
indicates the a correctly answered question.Having this structure would give you a create deal of flexibility, since all the standard
na.rm
arguments still work with your data vector, but you can use more complex concepts with the factor vector.Update following questions from @gsk3
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
这不仅仅是一个“技术”问题。您应该在缺失值分析和插补方面拥有全面的统计背景。一种解决方案需要使用 R 和 ggobi。您可以为几种类型的 NA 分配极负值(将 NA 放入余量中),并“手动”执行一些诊断。您应该记住,NA 分为三种类型:
恕我直言,这个问题更适合CrossValidated。
但这里有一个来自 SO 的链接,您可能会发现有用:
处理 R 中缺失/不完整的数据--是否有功能可以屏蔽但不删除 NA?
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
您可以完全放弃 NA,只使用编码值。然后,您还可以将它们汇总到全局缺失值。我通常更喜欢在没有 NA 的情况下进行编码,因为 NA 可能会导致编码问题,而且我希望能够准确控制分析的内容。如果还使用字符串“NA”来表示 NA,这通常会让事情变得更容易。
——拉尔夫·温特斯
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
我通常将它们用作值,正如拉尔夫已经建议的那样,因为缺失值的类型似乎是数据,但在一两次我主要希望将其用于文档的情况下,我在值上使用了属性,例如
这样我的分析很干净,但我仍然保留文档。但正如我所说:通常我会保留价值观。
艾伦.
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
我想在这里添加“统计背景成分”。 缺失数据的统计分析对此非常好读。
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.