如何处理 R 中的多种缺失?

发布于 2024-10-23 00:17:04 字数 424 浏览 3 评论 0原文

许多调查都有不同类型缺失的代码。例如,密码本可能指示:

0-99数据

-1 未提出问题

-5 不知道

-7拒绝回应

-9 未询问模块

Stata 有一个漂亮的工具来处理这些多种缺失,因为它允许您分配一个通用的 。缺失数据,但也允许更具体类型的缺失(.a、.b、.c、...、.z)。所有查看缺失报告的命令都会回答所有指定的缺失条目,但您也可以稍后整理出各种缺失。当您认为拒绝回答对插补策略的影响与未提出问题的影响不同时,这尤其有用。

我从未在 R 中遇到过这样的功能,但我真的很想拥有这种功能。有没有办法标记几种不同类型的 NA?我可以想象创建更多数据(包含缺失类型的长度为 nrow(my.data.frame) 的向量,或者包含缺失类型的行的更紧凑的索引),但这看起来相当笨重。

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:

0-99 Data

-1 Question not asked

-5 Do not know

-7 Refused to respond

-9 Module not asked

Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.

I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

心的位置 2024-10-30 00:17:04

我知道你在寻找什么,而这在 R 中没有实现。我不知道实现它的包,但自己编写代码并不太困难。

一种可行的方法是将包含代码的数据框添加到属性中。为了防止整个数据帧加倍并节省空间,我将在该数据帧中添加索引,而不是重建完整的数据帧。

例如:

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

这允许这样做:

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

还可以调整该函数以添加一个额外的属性,为您提供不同值的标签,另请参阅这个问题。您可以通过以下方式进行反向转换:

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

如果有必要,这允许仅更改您想要的代码。当没有给出参数时,该函数可以适用于返回所有代码。可以构建类似的函数来根据代码提取数据,我想您可以自己解决这个问题。

但一句话:使用属性和索引可能是一种很好的方法。

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

怀里藏娇 2024-10-30 00:17:04

最明显的方法似乎是使用两个向量:

  • 向量 1:数据向量,其中所有缺失值均使用 NA 表示。例如,c(2, 50, NA, NA)
  • 向量2:因子向量,指示数据类型。例如,factor(c(1, 1, -1, -7)),其中因子 1 表示正确回答的问题。

拥有这种结构将为您带来很大的灵活性,因为所有标准 na.rm 参数仍然适用于您的数据向量,但您可以对因子向量使用更复杂的概念。

更新@gsk3的以下问题

  1. 数据存储将大幅增加:数据存储将翻倍。但是,如果将大小加倍会导致真正的问题,则可能值得考虑其他策略。
  2. 程序不会自动处理它。这是一个奇怪的评论。默认情况下,某些函数会以合理的方式处理 NA。然而,您希望以不同的方式对待 NA,这意味着您必须做一些定制的事情。如果您只想分析 NA 为“未提出问题”的数据,则只需使用数据框子集。
  3. 现在,每次你想在概念上操作一个变量时,你都必须一起操作两个向量 我想我设想了两个向量的数据框。我将根据第二个向量对数据框进行子集化。
  4. 没有标准实现,因此我的解决方案可能与其他人的不同。确实如此。但是,如果现成的软件包不能满足您的需求,那么(几乎)根据定义您想要做一些不同的事情。

我应该声明,我从未分析过调查数据(尽管我分析过大型生物数据集)。我上面的回答看起来很防御性,但这不是我的本意。我认为你的问题提得很好,我对其他答案感兴趣。

The most obvious way seems to use two vectors:

  • Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
  • Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.

Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.

Update following questions from @gsk3

  1. Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
  2. Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
  3. now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
  4. There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.

I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.

几味少女 2024-10-30 00:17:04

这不仅仅是一个“技术”问题。您应该在缺失值分析和插补方面拥有全面的统计背景。一种解决方案需要使用 R 和 ggobi。您可以为几种类型的 NA 分配极负值(将 NA 放入余量中),并“手动”执行一些诊断。您应该记住,NA 分为三种类型:

  • MCAR - 完全随机缺失,其中 P(missing|observed,unobserved) = P(missing)
  • MAR - 随机缺失,其中 P(missing|observed,unobserved) = P(missing|observed)
  • MNAR - 非随机(或不可忽略)缺失,其中 P(missing|observed,unobserved) 无法以任何方式量化。

恕我直言,这个问题更适合CrossValidated

但这里有一个来自 SO 的链接,您可能会发现有用:

处理 R 中缺失/不完整的数据--是否有功能可以屏蔽但不删除 NA?

This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:

  • MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
  • MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
  • MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.

IMHO this question is more suitable for CrossValidated.

But here's a link from SO that you may find useful:

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

旧人哭 2024-10-30 00:17:04

您可以完全放弃 NA,只使用编码值。然后,您还可以将它们汇总到全局缺失值。我通常更喜欢在没有 NA 的情况下进行编码,因为 NA 可能会导致编码问题,而且我希望能够准确控制分析的内容。如果还使用字符串“NA”来表示 NA,这通常会让事情变得更容易。

——拉尔夫·温特斯

You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.

-Ralph Winters

短叹 2024-10-30 00:17:04

我通常将它们用作值,正如拉尔夫已经建议的那样,因为缺失值的类型似乎是数据,但在一两次我主要希望将其用于文档的情况下,我在值上使用了属性,例如

> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1

这样我的分析很干净,但我仍然保留文档。但正如我所说:通常我会保留价值观。

艾伦.

I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.

> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1

That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.

Allan.

旧人九事 2024-10-30 00:17:04

我想在这里添加“统计背景成分”。 缺失数据的统计分析对此非常好读。

I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文