“说服”用于对数据框中的 NA 进行计数的制表函数 [R]
我想再问你一个问题。它基本上是关于 [R] 中的数据框、NA 和制表函数。
我有这个数据框。我已经在我之前的问题中使用过这个。它故意看起来如此简单,我真正的“df”数据框实际上要大得多,我不想惹恼任何拥有庞大数据库的人……所以,我的数据库:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
df <-data.frame(id,a,b,c,d,e)
df
我已经设法计算出列“b”中出现的数字的分布' 到 'e',但同时考虑到这些分布应按 'id' 列中的 id 数字“分组”这一事实。工作正常,检查一下 ->
matrix(matrix(unlist(lapply(df[,(-(1))],
function(x) tapply(x,df$id,tabulate,
nbins=nlevels(factor(df[,2])))) [[1]])),
ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3])))) [[2]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4])))) [[3]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5])))) [[4]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6])))) [[5]])),ncol=4,nrow=3,byrow=TRUE)
现在我的问题是:如果我的数据框到处包含 NA 值怎么办?如果我希望内置的制表函数也收集这些 NA 该怎么办?那么如果我想让它计算这些 NA 出现的次数该怎么办?
这是我用 NA 修改的数据框:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
df <-data.frame(id,a,b,c,d,e)
df
首先,我尝试了这样的操作:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL)))) [[1]])
你看,我所做的唯一一件事就是尝试应用这个 exclude=NULL
东西。
至少我的代码意识到我在列 a
(1,2,3,NA)
中有 4 个不同的级别,而不仅仅是三个 (1,2 ,3)
。在这里检查一下:
nlevels(factor(df[,2], exclude=NULL))
但是您在结果中看到它无法计算 NA。它说
3 0 6 0 4 3 3 0 4 1 5 0
而不是正确的:
3 0 6 1 4 3 3 0 4 1 5 0
或者在以下情况下:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL)))) [[3]])
它说
2 4 4 0 2 3 4 0 1 5 4 0
而不是正确的
2 4 4 0 2 3 4 1 1 5 4 0
等等。
有人知道如何“说服”函数制表来计算 NA 吗?有可能吗?
非常感谢,祝您周末愉快,
拉斯洛
I’d like to ask you a question again. It is basically about data frames, NAs and tabulate function in [R].
I have this data frame. I already used this in a previous question of mine. It intentionally looks this simple, my real ’df’ dataframe is much bigger actually and again, I am not willing to annoy anyone with huge databases… So, my database:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
df <-data.frame(id,a,b,c,d,e)
df
I have managed to calculate the distributions of the numbers occurring in columns ’b’ to ’e’ but considering the fact at the very same time that these distributions should be ’groupped by’ the id numbers in column ’id’. It works fine, check it ->
matrix(matrix(unlist(lapply(df[,(-(1))],
function(x) tapply(x,df$id,tabulate,
nbins=nlevels(factor(df[,2])))) [[1]])),
ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3])))) [[2]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4])))) [[3]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5])))) [[4]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6])))) [[5]])),ncol=4,nrow=3,byrow=TRUE)
Now my problem is: what if my data frame contains NA values here and there and what if I want my in-built tabulate function to collect these NAs as well? So what if I want it to count how many occurrences I have from these NAs?
Here’s my modified data frame with the NAs:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
df <-data.frame(id,a,b,c,d,e)
df
At first I tried something like this:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL)))) [[1]])
You see, the only thing I did was that I tried to apply this exclude=NULL
thing.
At least my code realizes the fact that I have 4 different levels in column a
(1,2,3,NA)
and not only three (1,2,3)
. Check it here:
nlevels(factor(df[,2], exclude=NULL))
But you see in the result that somehow it could not calculate the NAs. It says
3 0 6 0 4 3 3 0 4 1 5 0
Instead of the correct:
3 0 6 1 4 3 3 0 4 1 5 0
Or in case of:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL)))) [[3]])
It says
2 4 4 0 2 3 4 0 1 5 4 0
Instead of the correct
2 4 4 0 2 3 4 1 1 5 4 0
etc.
Does someone have any ideas how to "persuade" the function tabulate to count NAs? Is it possible at all?
Thanks very much and have a pleasant weekend,
Laszlo
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以将重复调用简化为:
这与重复矩阵调用几乎相同,例如,对于您的第一个(非 NA)调用:
那么我们现在可以修改它来处理 NA 吗?是的,使用
table()
函数的useNA
参数。将您的df
与NA
结合使用,我们可以得到:因为我们仅在表中要求
NA
,如果>NA
存在,并非tabs
中的所有表格都具有相同的列数。如果这很重要,我们可以将useNA = "ifany"
更改为useNA = "always"
并且所有结果表将具有相同的列数,但是它增加了另一个 id 行:最后一个添加得到了我们想要的 - 我们使用
addNA()
向每个id
的集合添加一个NA
级别数字,即使没有记录NA
:这给出:
You can simplify your repeated calls to:
which gives almost the same as your repeated matrix calls, e.g. for your first (non-NA) one:
So can we now modify this to deal with
NA
? Yes, using theuseNA
argument of thetable()
function. Using yourdf
withNA
, we have:Because we ask for
NA
in the table only if anNA
exists, not all the tables intabs
have the same number of columns. If that is important, we can changeuseNA = "ifany"
to beuseNA = "always"
and all the result tables will have the same number of columns, however it adds another id row:One final addition gets what we want - we use
addNA()
to add anNA
level to eachid
's set of numbers, even if there are noNA
s recorded:Which gives:
你不能只使用
is.na
吗?如果您想计算 NA 或非零条目的数量,您可以sum(is.na(my.var)|my.var>0)
。Can't you just use
is.na
? If you want to count up the number of entries that are NA or non-zero, you couldsum(is.na(my.var)|my.var>0)
.