如何用组/子集的平均值替换 NA？

发布于 2025-01-06 14:50:45 字数 2039 浏览 1 评论 0原文

我有一个数据框，其中包含来自蝾螈内脏的各种节肢动物的长度和宽度。因为有些肠道有数千种特定的猎物，所以我只测量了每种猎物类型的一个子集。我现在想用该猎物的平均长度和宽度替换每个未测量的个体。我想保留数据框并仅添加估算列（长度2，宽度2）。主要原因是每一行还有包含收集蝾螈的日期和位置数据的列。我可以用随机选择的测量个体来填充 NA，但为了论证起见，我们假设我只想用平均值替换每个 NA。

例如，假设我有一个看起来像这样的数据框：

id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA

实际上，我有更多的列和大约 25 个不同的分类单元，总共约有 30,000 个猎物项目。看起来 plyr 包可能是理想的选择，但我只是不知道如何做到这一点。我不太懂 R 或编程，但我正在努力学习。

并不是说我知道我在做什么，但如果有帮助的话，我会尝试创建一个小数据集来玩。

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

以下是我尝试过的一些方法（尚未奏效）：

# mean imputation to recode NA in length and width with means 
  (could do random imputation but unnecessary here)
mean.imp <- function(x) { 
  missing <- is.na(x) 
  n.missing <-sum(missing) 
  x.obs <-a[!missing] 
  imputed <- x 
  imputed[missing] <- mean(x.obs) 
  return (imputed) 
  } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])

n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
  mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr?

另一种尝试：

imp.mean <- function(x) {
  a <- mean(x, na.rm = TRUE)
  return (ifelse (is.na(x) == TRUE , a, x)) 
 } # tried but not sure how to use this in ddply

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
  a <- mean(exampleDF$length, na.rm = TRUE)
  return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
  })

有什么建议吗？

原文

I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.

For example imagine I have a dataframe that looks something like:

id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA

In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.

Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

Here are a few things I've tried (that haven't worked):

# mean imputation to recode NA in length and width with means 
  (could do random imputation but unnecessary here)
mean.imp <- function(x) { 
  missing <- is.na(x) 
  n.missing <-sum(missing) 
  x.obs <-a[!missing] 
  imputed <- x 
  imputed[missing] <- mean(x.obs) 
  return (imputed) 
  } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])

n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
  mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr?

another attempt:

imp.mean <- function(x) {
  a <- mean(x, na.rm = TRUE)
  return (ifelse (is.na(x) == TRUE , a, x)) 
 } # tried but not sure how to use this in ddply

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
  a <- mean(exampleDF$length, na.rm = TRUE)
  return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
  })

Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼眸印温柔 2025-01-13 14:50:45

不是我自己的技术，我不久前在板上看到它：

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)


library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

编辑带有for循环的非plyr方法：

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

编辑很多个月后在这里是一个data.table & dplyr 方法：

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)


library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )

回复收藏 0 原文

玻璃人 2025-01-13 14:50:45

其他几个选项：

1) 带有 data .table 的新 nafill-function

library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, function(x) nafill(x, type = "const", fill = mean(x, na.rm = TRUE)))
    , by = taxa
    , .SDcols = cols][]

2) 与 zoo 的 na.aggregate-function

library(zoo)
library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, na.aggregate)
    , by = taxa
    , .SDcols = cols][]

na.aggregate 的默认函数是 mean;如果您想使用另一个函数，您应该使用 FUN 参数指定（例如：FUN =median）。另请参阅带有 ?na.aggregate 的帮助文件。

当然你也可以在 tidyverse 中使用它：

library(dplyr)
library(zoo)

dat %>% 
  group_by(taxa) %>% 
  mutate_at(cols, na.aggregate)

Several other options:

1) with data.table's new nafill-function

library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, function(x) nafill(x, type = "const", fill = mean(x, na.rm = TRUE)))
    , by = taxa
    , .SDcols = cols][]

2) with zoo's na.aggregate-function

library(zoo)
library(data.table)
setDT(dat)

cols <- c("length", "width")

dat[, (cols) := lapply(.SD, na.aggregate)
    , by = taxa
    , .SDcols = cols][]

The default function from na.aggregate is mean; if you want to use another function you should specify that with the FUN-parameter (example: FUN = median). See also the help-file with ?na.aggregate.

Of course you can also use this in the tidyverse:

library(dplyr)
library(zoo)

dat %>% 
  group_by(taxa) %>% 
  mutate_at(cols, na.aggregate)

回复收藏 0 原文

独自←快乐 2025-01-13 14:50:45

在回答这个问题之前，我想说我是 R 的初学者。因此，如果您觉得我的答案是错误的，请告诉我。

代码：

DF[is.na(DF$length), "length"] <- mean(na.omit(telecom_original_1$length))

并对宽度应用相同的方法。

DF 代表数据框的名称。

谢谢，
帕蒂

Before answering this, I want to say that am a beginner in R. Hence, please let me know if you feel my answer is wrong.

Code:

DF[is.na(DF$length), "length"] <- mean(na.omit(telecom_original_1$length))

and apply the same for width.

DF stands for name of the data.frame.

Thanks,
Parthi

回复收藏 0 原文

陌伤浅笑 2025-01-13 14:50:45

`R-base`

另一种基于 R-base 的方法依赖于 vapply() + ave()。

类强制

> vapply(X = exampleDF, FUN = class, FUN.VALUE = "integer")
         id        taxa      length       width 
  "integer" "character" "character" "character"

由于应执行平均插补的列属于“字符”类，因此我们事先将它们强制为数字：

exampleDF[, c("length", "width")] <- 
  lapply(exampleDF[, c("length", "width")], as.numeric)

>方法

# exampleDF[, c("length", "width")] <- 
vapply(X = exampleDF[, c("length", "width")], 
       FUN = \(x) {
         ave(x = x, 
             exampleDF[, "taxa"], # grouping 
             FUN = \(y) {
               y[is.na(y)] <- mean(y, na.rm = TRUE) 
               y 
               }
             )
       },
       FUN.VALUE = numeric(length = nrow(exampleDF))
       )

OP 的数据示例

exampleDF <- 
  data.frame(id = seq(1:100), 
             taxa = c(rep("collembola", 50), rep("mite", 25), rep("ant", 25)), 
             length = c(rnorm(40, 1, 0.5), rep("NA", 10), 
                        rnorm(20, 0.8, 0.1), rep("NA", 5), 
                        rnorm(20, 2.5, 0.5), rep("NA", 5)), 
             width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
                       rnorm(20, 0.3, 0.01), rep("NA", 5), 
                       rnorm(20, 1, 0.1), rep("NA", 5)))

封装在一个小函数中：

# In contrast to aggregate() and ave(), 
# this wrapper does not use non-standard evaluation 
impute = \(x, by, data) {
  if( !(any(is.numeric(data[, x])) | is.data.frame(data)) ) stop("Error")
  data[x] =
    vapply(data[x], \(i) {
      ave(x = i, data[by], # grouping 
          FUN = \(y) { y[is.na(y)] = mean(y, na.rm = TRUE); y })},
      FUN.VALUE = numeric(length = nrow(data))
    )
  return(data)
}

并应用于来自@TylerRinker 的答案：

> impute(x = c("length", "width"), by = "taxa", data = ori)
   id       taxa length width
1 101 collembola    2.1  0.90
2 102       mite    0.9  0.70
3 103       mite    1.1  0.80
4 104 collembola    1.8  0.70
5 105 collembola    1.5  0.50
6 106       mite    1.0  0.75

数据来自 @TylerRinker 答案：

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header = TRUE)

`R-base`

Another R-base approach relying on vapply() + ave().

Class Coercion

> vapply(X = exampleDF, FUN = class, FUN.VALUE = "integer")
         id        taxa      length       width 
  "integer" "character" "character" "character"

As the columns on which mean imputation should be performed on are of class "character", we coerce them to numeric beforehand:

exampleDF[, c("length", "width")] <- 
  lapply(exampleDF[, c("length", "width")], as.numeric)

Approach

# exampleDF[, c("length", "width")] <- 
vapply(X = exampleDF[, c("length", "width")], 
       FUN = \(x) {
         ave(x = x, 
             exampleDF[, "taxa"], # grouping 
             FUN = \(y) {
               y[is.na(y)] <- mean(y, na.rm = TRUE) 
               y 
               }
             )
       },
       FUN.VALUE = numeric(length = nrow(exampleDF))
       )

OP's data example

exampleDF <- 
  data.frame(id = seq(1:100), 
             taxa = c(rep("collembola", 50), rep("mite", 25), rep("ant", 25)), 
             length = c(rnorm(40, 1, 0.5), rep("NA", 10), 
                        rnorm(20, 0.8, 0.1), rep("NA", 5), 
                        rnorm(20, 2.5, 0.5), rep("NA", 5)), 
             width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
                       rnorm(20, 0.3, 0.01), rep("NA", 5), 
                       rnorm(20, 1, 0.1), rep("NA", 5)))

Wrapped in a samll function:

# In contrast to aggregate() and ave(), 
# this wrapper does not use non-standard evaluation 
impute = \(x, by, data) {
  if( !(any(is.numeric(data[, x])) | is.data.frame(data)) ) stop("Error")
  data[x] =
    vapply(data[x], \(i) {
      ave(x = i, data[by], # grouping 
          FUN = \(y) { y[is.na(y)] = mean(y, na.rm = TRUE); y })},
      FUN.VALUE = numeric(length = nrow(data))
    )
  return(data)
}

and applied to the data from @TylerRinker's answer:

> impute(x = c("length", "width"), by = "taxa", data = ori)
   id       taxa length width
1 101 collembola    2.1  0.90
2 102       mite    0.9  0.70
3 103       mite    1.1  0.80
4 104 collembola    1.8  0.70
5 105 collembola    1.5  0.50
6 106       mite    1.0  0.75

Data from @TylerRinker's answer:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header = TRUE)

回复收藏 0 原文

猫性小仙女 2025-01-13 14:50:45

扩展 @Tyler Rinker 的解决方案，假设 features 是要插补的列。在本例中，特征 <- c('length', 'width')。然后使用 data.table 解决方案变为：

library(data.table)
setDT(dat)

dat[, (features) := lapply(.SD, impute.mean), by = taxa, .SDcols = features]

Expanding on @Tyler Rinker's solution, suppose features are the columns to impute. In this case features <- c('length', 'width'). Then using data.table the solution becomes:

library(data.table)
setDT(dat)

dat[, (features) := lapply(.SD, impute.mean), by = taxa, .SDcols = features]

回复收藏 0 原文

盗琴音 2025-01-13 14:50:45

我遇到了类似的事件，我可以提供一个非常简单的步骤来改变列的分组平均值。

library(tidyr)

dataset <- dataset %>% group_by(taxa) %>% mutate(length1= ifelse(is.na(length),mean(length,na.rm = T),length))

View(dataset)

如果我可以提供任何进一步的帮助，请告诉我。

I came across a similar incident and I can give a very simple step to mutate group-wise average for your columns.

library(tidyr)

dataset <- dataset %>% group_by(taxa) %>% mutate(length1= ifelse(is.na(length),mean(length,na.rm = T),length))

View(dataset)

Let me know if I can be of any further help.

回复收藏 0 原文

~没有更多了~

关于作者

拥抱没勇气

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何用组/子集的平均值替换 NA？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

`R-base`

`R-base`

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如何用组/子集的平均值替换 NA？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

R-base

R-base

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

`R-base`

`R-base`