“说服”用于对数据框中的 NA 进行计数的制表函数 [R]

发布于 2024-10-22 23:35:02 字数 3050 浏览 3 评论 0原文

我想再问你一个问题。它基本上是关于 [R] 中的数据框、NA 和制表函数。

我有这个数据框。我已经在我之前的问题中使用过这个。它故意看起来如此简单，我真正的“df”数据框实际上要大得多，我不想惹恼任何拥有庞大数据库的人……所以，我的数据库：

id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
df <-data.frame(id,a,b,c,d,e)
df

我已经设法计算出列“b”中出现的数字的分布' 到 'e'，但同时考虑到这些分布应按 'id' 列中的 id 数字“分组”这一事实。工作正常，检查一下 ->

matrix(matrix(unlist(lapply(df[,(-(1))], 
       function(x) tapply(x,df$id,tabulate,
                          nbins=nlevels(factor(df[,2])))) [[1]])), 
              ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3])))) [[2]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4])))) [[3]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5])))) [[4]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6])))) [[5]])),ncol=4,nrow=3,byrow=TRUE)

现在我的问题是：如果我的数据框到处包含 NA 值怎么办？如果我希望内置的制表函数也收集这些 NA 该怎么办？那么如果我想让它计算这些 NA 出现的次数该怎么办？

这是我用 NA 修改的数据框：

id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
df <-data.frame(id,a,b,c,d,e)
df

首先，我尝试了这样的操作：

unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL)))) [[1]])

你看，我所做的唯一一件事就是尝试应用这个 exclude=NULL 东西。

至少我的代码意识到我在列 a (1,2,3,NA) 中有 4 个不同的级别，而不仅仅是三个 (1,2 ,3)。在这里检查一下：

nlevels(factor(df[,2], exclude=NULL))

但是您在结果中看到它无法计算 NA。它说

3  0  6  0  4  3  3  0  4  1  5  0

而不是正确的：

3  0  6  1  4  3  3  0  4  1  5  0

或者在以下情况下：

unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL)))) [[3]])

它说

2  4  4  0  2  3  4  0  1  5  4  0

而不是正确的

2  4  4  0  2  3  4  1  1  5  4  0

等等。

有人知道如何“说服”函数制表来计算 NA 吗？有可能吗？

非常感谢，祝您周末愉快，

拉斯洛

原文

I’d like to ask you a question again. It is basically about data frames, NAs and tabulate function in [R].

I have this data frame. I already used this in a previous question of mine. It intentionally looks this simple, my real ’df’ dataframe is much bigger actually and again, I am not willing to annoy anyone with huge databases… So, my database:

id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
df <-data.frame(id,a,b,c,d,e)
df

I have managed to calculate the distributions of the numbers occurring in columns ’b’ to ’e’ but considering the fact at the very same time that these distributions should be ’groupped by’ the id numbers in column ’id’. It works fine, check it ->

matrix(matrix(unlist(lapply(df[,(-(1))], 
       function(x) tapply(x,df$id,tabulate,
                          nbins=nlevels(factor(df[,2])))) [[1]])), 
              ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3])))) [[2]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4])))) [[3]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5])))) [[4]])),ncol=3,nrow=3,byrow=TRUE)

matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6])))) [[5]])),ncol=4,nrow=3,byrow=TRUE)

Now my problem is: what if my data frame contains NA values here and there and what if I want my in-built tabulate function to collect these NAs as well? So what if I want it to count how many occurrences I have from these NAs?

Here’s my modified data frame with the NAs:

id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
df <-data.frame(id,a,b,c,d,e)
df

At first I tried something like this:

unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL)))) [[1]])

You see, the only thing I did was that I tried to apply this exclude=NULL thing.

At least my code realizes the fact that I have 4 different levels in column a (1,2,3,NA) and not only three (1,2,3). Check it here:

nlevels(factor(df[,2], exclude=NULL))

But you see in the result that somehow it could not calculate the NAs. It says

3  0  6  0  4  3  3  0  4  1  5  0

Instead of the correct:

3  0  6  1  4  3  3  0  4  1  5  0

Or in case of:

unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL)))) [[3]])

It says

2  4  4  0  2  3  4  0  1  5  4  0

Instead of the correct

2  4  4  0  2  3  4  1  1  5  4  0

etc.

Does someone have any ideas how to "persuade" the function tabulate to count NAs? Is it possible at all?

Thanks very much and have a pleasant weekend,

Laszlo

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

禾厶谷欠 2024-10-29 23:35:02

您可以将重复调用简化为：

tabs <-lapply(df[,2:6], function(x, id){ t(table(x, id)) }, df$id)

这与重复矩阵调用几乎相同，例如，对于您的第一个（非 NA）调用：

> tabs[[1]]
   x
id  1 2 3
  1 3 0 7
  2 4 3 3
  3 4 1 5

那么我们现在可以修改它来处理 NA 吗？是的，使用 table() 函数的 useNA 参数。将您的 df 与 NA 结合使用，我们可以得到：

tabs <-lapply(df[,2:6], 
              function(x, id){ t(table(x, id, useNA = "ifany")) }, df$id)

> tabs[[1]]
   x
id  1 2 3 <NA>
  1 3 0 6    1
  2 4 3 3    0
  3 4 1 5    0

因为我们仅在表中要求 NA，如果 >NA 存在，并非 tabs 中的所有表格都具有相同的列数。如果这很重要，我们可以将 useNA = "ifany" 更改为 useNA = "always" 并且所有结果表将具有相同的列数，但是它增加了另一个 id 行：

> tabs[[1]]
      x
id     1 2 3 <NA>
  1    3 0 6    1
  2    4 3 3    0
  3    4 1 5    0
  <NA> 0 0 0    0

最后一个添加得到了我们想要的 - 我们使用 addNA() 向每个 id 的集合添加一个 NA 级别数字，即使没有记录 NA：

tabs <-lapply(df[,2:6], 
              function(x, id){ t(table(addNA(x), id, useNA = "ifany")) }, df$id)

这给出：

> tabs
$a

id  1 2 3 <NA>
  1 3 0 6    1
  2 4 3 3    0
  3 4 1 5    0

$b

id  1 2 3 <NA>
  1 8 1 1    0
  2 6 3 1    0
  3 2 4 4    0

$c

id  1 2 3 <NA>
  1 2 4 4    0
  2 2 3 4    1
  3 1 5 4    0

$d

id  1 2 3 <NA>
  1 2 3 5    0
  2 2 6 2    0
  3 5 3 2    0

$e

id  1 2 3 4 <NA>
  1 4 3 3 0    0
  2 4 2 4 0    0
  3 1 3 4 1    1

You can simplify your repeated calls to:

tabs <-lapply(df[,2:6], function(x, id){ t(table(x, id)) }, df$id)

which gives almost the same as your repeated matrix calls, e.g. for your first (non-NA) one:

> tabs[[1]]
   x
id  1 2 3
  1 3 0 7
  2 4 3 3
  3 4 1 5

So can we now modify this to deal with NA? Yes, using the useNA argument of the table() function. Using your df with NA, we have:

tabs <-lapply(df[,2:6], 
              function(x, id){ t(table(x, id, useNA = "ifany")) }, df$id)

> tabs[[1]]
   x
id  1 2 3 <NA>
  1 3 0 6    1
  2 4 3 3    0
  3 4 1 5    0

Because we ask for NA in the table only if an NA exists, not all the tables in tabs have the same number of columns. If that is important, we can change useNA = "ifany" to be useNA = "always" and all the result tables will have the same number of columns, however it adds another id row:

> tabs[[1]]
      x
id     1 2 3 <NA>
  1    3 0 6    1
  2    4 3 3    0
  3    4 1 5    0
  <NA> 0 0 0    0

One final addition gets what we want - we use addNA() to add an NA level to each id's set of numbers, even if there are no NAs recorded:

tabs <-lapply(df[,2:6], 
              function(x, id){ t(table(addNA(x), id, useNA = "ifany")) }, df$id)

Which gives:

> tabs
$a

id  1 2 3 <NA>
  1 3 0 6    1
  2 4 3 3    0
  3 4 1 5    0

$b

id  1 2 3 <NA>
  1 8 1 1    0
  2 6 3 1    0
  3 2 4 4    0

$c

id  1 2 3 <NA>
  1 2 4 4    0
  2 2 3 4    1
  3 1 5 4    0

$d

id  1 2 3 <NA>
  1 2 3 5    0
  2 2 6 2    0
  3 5 3 2    0

$e

id  1 2 3 4 <NA>
  1 4 3 3 0    0
  2 4 2 4 0    0
  3 1 3 4 1    1

回复收藏 0 原文