处理 R 中丢失/不完整的数据——是否有屏蔽但不删除 NA 的功能?

发布于 2024-08-28 12:58:01 字数 849 浏览 13 评论 0原文

正如您对旨在数据分析的 DSL 所期望的那样,R 可以很好地处理丢失/不完整的数据,例如:

许多 R 函数都有一个 na.rm 标志,当设置该标志时为TRUE,删除 NA:

>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
      (5, 6, 12, 87, 9, 43, 67)

但是如果您想在函数调用之前处理 NA,则需要执行如下操作:

从向量:

vx = vx[!is.na(a)]

从向量中删除每个“NA”并将其替换为“0”:

ifelse(is.na(vx), 0, vx)

从数据框中删除包含“NA”的每一行:

dfx = dfx[complete.cases(dfx),]

所有这些函数都会永久删除' NA' 或其中包含“NA”的行。

有时这并不完全是您想要的——制作数据框的“NA”切除副本可能对于工作流程中的下一步是必要的,但在后续步骤中您通常希望返回这些行(例如,计算因先前调用“完整案例”而导致缺少行的列的按列统计,但该列中没有“NA”值)。

尽可能清楚地了解我正在寻找的内容:python/numpy 有一个类,屏蔽数组,带有 mask 方法,它可以让您隐藏——但不删除——函数调用期间的 NA。 R中有类似的函数吗?

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance:

Many R functions have an na.rm flag that when set to TRUE, remove the NAs:

>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
      (5, 6, 12, 87, 9, 43, 67)

But if you want to deal with NAs before the function call, you need to do something like this:

to remove each 'NA' from a vector:

vx = vx[!is.na(a)]

to remove each 'NA' from a vector and replace it w/ a '0':

ifelse(is.na(vx), 0, vx)

to remove entire each row that contains 'NA' from a data frame:

dfx = dfx[complete.cases(dfx),]

All of these functions permanently remove 'NA' or rows with an 'NA' in them.

Sometimes this isn't quite what you want though--making an 'NA'-excised copy of the data frame might be necessary for the next step in the workflow but in subsequent steps you often want those rows back (e.g., to calculate a column-wise statistic for a column that has missing rows caused by a prior call to 'complete cases' yet that column has no 'NA' values in it).

to be as clear as possible about what i'm looking for: python/numpy has a class, masked array, with a mask method, which lets you conceal--but not remove--NAs during a function call. Is there an analogous function in R?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

我一向站在原地 2024-09-04 12:58:01

究竟如何处理丢失的数据(如果我们知道丢失的数据可能会被标记为NA),可能因域而异。

举一个与时间序列相关的示例,您可能想要跳过、填充、插值或以不同方式插值,...只是(非常有用且流行)zoo 具有与 NA 处理相关的所有这些功能:

zoo::na.approx  zoo::na.locf    
zoo::na.spline  zoo::na.trim    

允许近似(使用不同的算法),向前或向后,使用样条插值或修剪。

另一个例子是 CRAN 上大量缺失的插补包——通常提供特定领域的解决方案。 [ 那么如果你称 R 为 DSL,那这是什么? “针对特定领域语言的子领域特定解决方案”还是 SDSSFDSL?相当拗口:)]

但是对于你的具体问题:不,我不知道基数R中有一个位级标志可以让你将观察结果标记为“被排除”。我想大多数 R 用户会求助于 na.omit() 等函数或使用您提到的 na.rm=TRUE 选项。

Exactly what to do with missing data -- which may be flagged as NA if we know it is missing -- may well differ from domain to domain.

To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA handling:

zoo::na.approx  zoo::na.locf    
zoo::na.spline  zoo::na.trim    

allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.

Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]

But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit() et al or use the na.rm=TRUE option you mentioned.

情绪失控 2024-09-04 12:58:01

查看数据是一个很好的做法,从而推断缺失值的类型:是 MCAR(完整且随机缺失)、MAR(随机缺失)还是 MNAR(非随机缺失)?基于这三种类型,您可以研究缺失值的底层结构,并得出插补是否完全适用的结论(如果它不是 MNAR,那么您很幸运,因为在这种情况下,缺失值被认为是不可忽略的,并且是与一些未知的潜在影响、因素、过程、变量……等等有关)。

“使用 R 和 GGobi 进行数据分析师的交互式和动态图形”,作者:Di CookDeborah Swayne > 是关于这个主题的很好的参考。

您将在本章中看到正在运行的 norm 包,但 Hmisc 包具有数据插补例程。另请参阅 Ameliacat(用于分类缺失插补)、mimitoolsVIM、vmv(用于缺失数据可视化)。

老实说,我仍然不太明白你的问题是关于统计,还是关于 R 缺失数据插补功能?我认为我已经为第二个和第一个提供了很好的参考:您可以用集中趋势(均值、中位数或类似的)替换您的 NA,从而减少变异性,或者用随机常数“拉出”观察(记录)的情况,或者您可以应用包含 NA 作为标准的变量和其他变量作为预测变量的回归分析,然后将残差分配给 NA ...这是处理 NA 的优雅方法,但通常不会放松你的CPU(我有1.1GHz的赛扬,所以我必须温柔)。

这是一个优化问题......没有明确的答案,您应该决定什么/为什么要坚持某种方法。但查看数据始终是一个好习惯! =)
一定要检查库克和Swayne - 这是一本出色的、写得很巧妙的指南。 Faraway 的“Linear Models with R” 还包含有关缺失值的章节。

就这样。

祝你好运! =)

It's a good practice to look at the data, hence infer about the type of missing values: is it MCAR (missing complete and random), MAR (missing at random) or MNAR (missing not at random)? Based on these three types, you can study the underlying structure of missing values and conclude whether imputation is at all applicable (you're lucky if it's not MNAR, 'cause, in that case, missing values are considered non-ignorable, and are related to some unknown underlying influence, factor, process, variable... whatever).

Chapter 3. in "Interactive and Dynamic Graphics for Data Analyst with R and GGobi" by Di Cook and Deborah Swayne is great reference regarding this topic.

You'll see norm package in action in this chapter, but Hmisc package has data imputation routines. See also Amelia, cat (for categorical missings imputation), mi, mitools, VIM, vmv (for missing data visualisation).

Honestly, I still don't quite understand is your question about statistics, or about R missing data imputation capabilities? I reckon that I've provided good references on second one, and about the first one: you can replace your NA's either with central tendency (mean, median, or similar), hence reduce the variability, or with random constant "pulled out" of observed (recorded) cases, or you can apply regression analysis with variable that contains NA's as criteria, and other variables as predictors, then assign residuals to NA's... it's an elegant way to deal with NA's, but quite often it would not go easy on your CPU (I have Celeron on 1.1GHz, so I have to be gentle).

This is an optimization problem... there's no definite answer, you should decide what/why are you sticking with some method. But it's always good practice to look at the data! =)
Be sure to check Cook & Swayne - it's an excellent, skilfully written guide. "Linear Models with R" by Faraway also contains a chapter about missing values.

So there.

Good luck! =)

橘香 2024-09-04 12:58:01

函数 na.exclude() 听起来像是您想要的,尽管它只是某些(重要)函数的一个选项。

在拟合和使用模型的上下文中,R 有一系列用于处理 NA 的通用函数:na.fail()na.pass()na.omit()na.exclude()。这些又是 R 的一些关键建模函数的参数,例如 lm()、glm() 和 nls()以及 MASS、rpart 和 Survival 包中的函数。

所有四个通用函数基本上都充当过滤器。 na.fail() 只有在没有 NA 的情况下才会传递数据,否则会失败。 na.pass() 传递所有案例。 na.omit()na.exclude() 都会忽略带有 NA 的案例,并传递其他案例。但是 na.exclude() 有一个不同的属性,它告诉处理结果对象的函数考虑 NA。如果您执行attributes(na.exclude(some_data_frame)),您就可以看到此属性。以下演示了 na.exclude() 如何在线性模型的上下文中改变 <​​code>predict() 的行为。

fakedata <- data.frame(x = c(1, 2, 3, 4), y = c(0, 10, NA, 40))

## We can tell the modeling function how to handle the NAs
r_omitted <- lm(x~y, na.action="na.omit", data=fakedata) 
r_excluded <- lm(x~y, na.action="na.exclude", data=fakedata)

predict(r_omitted)
#        1        2        4 
# 1.115385 1.846154 4.038462 
predict(r_excluded)
#        1        2        3        4 
# 1.115385 1.846154       NA 4.038462 

顺便说一句,您的默认 na.action 由 options("na.action") 确定,并以 na.omit() 开头,但您可以设置它。

The function na.exclude() sounds like what you want, although it's only an option for some (important) functions.

In the context of fitting and working with models, R has a family of generic functions for dealing with NAs: na.fail(), na.pass(), na.omit(), and na.exclude(). These are, in turn, arguments for some of R's key modeling functions, such as lm(), glm(), and nls() as well as functions in MASS, rpart, and survival packages.

All four generic functions basically act as filters. na.fail() will only pass the data through if there are no NAs, otherwise it fails. na.pass() passes all cases through. na.omit() and na.exclude() will both leave out cases with NAs and pass the other cases through. But na.exclude() has a different attribute that tells functions processing the resulting object to take into account the NAs. You could see this attribute if you did attributes(na.exclude(some_data_frame)). Here's a demonstration of how na.exclude() alters the behavior of predict() in the context of a linear model.

fakedata <- data.frame(x = c(1, 2, 3, 4), y = c(0, 10, NA, 40))

## We can tell the modeling function how to handle the NAs
r_omitted <- lm(x~y, na.action="na.omit", data=fakedata) 
r_excluded <- lm(x~y, na.action="na.exclude", data=fakedata)

predict(r_omitted)
#        1        2        4 
# 1.115385 1.846154 4.038462 
predict(r_excluded)
#        1        2        3        4 
# 1.115385 1.846154       NA 4.038462 

Your default na.action, by the way, is determined by options("na.action") and begins as na.omit() but you can set it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文