替换 R 中缺失值的均值或众数

发布于 2024-12-09 07:35:06 字数 1292 浏览 3 评论 0原文

我有一个由混合数据类型（数字、字符、因子、序数因子）缺少值，而我是尝试创建一个 for 循环来替换缺失的值如果是数值，则使用相应列的平均值；如果是字符/因子，则使用众数。

这就是我到目前为止所拥有的：

#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
        Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
} 
}

其中“Mode”是函数：

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1)
        xmode <- ">1 mode"
    return(xmode)
}

看起来它只是忽略了这些语句，而不给出任何错误... 我还尝试使用索引来处理第一部分：

## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]

但我收到此错误：“colMeans(df_test, na.rm = TRUE) 中的错误：‘x’必须是数字”

有人知道如何解决这个问题吗？

非常感谢大家的大力帮助！ -f

原文

I have a large database made up of mixed data types (numeric,
character, factor, ordinal factor) with missing values, and I am
trying to create a for loop to substitute the missing values
using either the mean of the respective column if numerical or the mode if character/factor.

This is what I have until now:

#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
        Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
} 
}

Where 'Mode' is the function:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1)
        xmode <- ">1 mode"
    return(xmode)
}

It seems as it is just ignoring the statements though, without giving
any error…
I have also tried to work the first part out with indexes:

## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]

But I get this error: "Error in colMeans(df_test, na.rm = TRUE) : 'x' must be numeric"

Does anybody have any idea how to solve this?

Thank you very much for all the great help!
-f

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感情洁癖 2024-12-16 07:35:06

如果您只是删除明显的错误，那么它就会按预期工作：

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

print(df_test)

#   age    a      b
# 1   5   aa banana
# 2   8   bb  apple
# 3  10 <NA>   pear
# 4  12   cc  grape
# 5  NA   cc   <NA>

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]) %in% c("character", "factor")) {
        df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
    }
}

print(df_test)

#     age  a       b
# 1  5.00 aa  banana
# 2  8.00 bb   apple
# 3 10.00 cc    pear
# 4 12.00 cc   grape
# 5  8.75 cc >1 mode

我建议您使用具有语法突出显示和括号匹配功能的编辑器，这将使您更容易找到此类语法错误。

If you simply remove the obvious bugs then it works as intended:

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)

print(df_test)

#   age    a      b
# 1   5   aa banana
# 2   8   bb  apple
# 3  10 <NA>   pear
# 4  12   cc  grape
# 5  NA   cc   <NA>

for (var in 1:ncol(df_test)) {
    if (class(df_test[,var])=="numeric") {
        df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
    } else if (class(df_test[,var]) %in% c("character", "factor")) {
        df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
    }
}

print(df_test)

#     age  a       b
# 1  5.00 aa  banana
# 2  8.00 bb   apple
# 3 10.00 cc    pear
# 4 12.00 cc   grape
# 5  8.75 cc >1 mode

I recommend that you use an editor with syntax highlighting and bracket matching, which would make it easier to find these sorts of syntax errors.

回复收藏 0 原文

无语# 2024-12-16 07:35:06

首先，您需要编写众数函数，考虑到分类数据的缺失值，其长度<1。
众数函数：

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

然后您可以迭代列，如果列是数字，则用均值填充缺失值，否则用众数填充缺失值。

下面的循环语句：

for (cols in colnames(df)) {
  if (cols %in% names(df[,sapply(df, is.numeric)])) {
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))

  }
  else {

    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))

  }
}

让我们提供一个示例：

library(tidyverse)

df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
           ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
           )

df

带有缺失值的初始 df：

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <int>   <dbl> <fct>   <fct>     <dbl>
 1     1      10 "A"     ""           NA
 2     2       9 "B"     "BB"         20
 3     3       8 "A"     "CC"         18
 4     4       7 "A"     "BB"         22
 5     5      NA ""      "BB"         18
 6     6      NA "B"     "CC"         17
 7     7      20 "A"     "AA"         19
 8     8      15 "B"     "BB"         NA
 9     9      12 ""      ""           17
10    10      NA "A"     "AA"         23

通过运行上面的 for 循环，我们得到：

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <dbl>   <dbl> <fct>   <fct>     <dbl>
 1     1    10   A       BB         19.2
 2     2     9   B       BB         20  
 3     3     8   A       CC         18  
 4     4     7   A       BB         22  
 5     5    11.6 A       BB         18  
 6     6    11.6 B       CC         17  
 7     7    20   A       AA         19  
 8     8    15   B       BB         19.2
 9     9    12   A       BB         17  
10    10    11.6 A       AA         23

正如我们所看到的，缺失值已被估算。您可以在此处查看示例

First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
The mode function:

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.

The loop statement below:

for (cols in colnames(df)) {
  if (cols %in% names(df[,sapply(df, is.numeric)])) {
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))

  }
  else {

    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))

  }
}

Let's provide an example:

library(tidyverse)

df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
           ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
           )

df

The initial df with the missing values:

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <int>   <dbl> <fct>   <fct>     <dbl>
 1     1      10 "A"     ""           NA
 2     2       9 "B"     "BB"         20
 3     3       8 "A"     "CC"         18
 4     4       7 "A"     "BB"         22
 5     5      NA ""      "BB"         18
 6     6      NA "B"     "CC"         17
 7     7      20 "A"     "AA"         19
 8     8      15 "B"     "BB"         NA
 9     9      12 ""      ""           17
10    10      NA "A"     "AA"         23

By running the for loop above, we get:

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <dbl>   <dbl> <fct>   <fct>     <dbl>
 1     1    10   A       BB         19.2
 2     2     9   B       BB         20  
 3     3     8   A       CC         18  
 4     4     7   A       BB         22  
 5     5    11.6 A       BB         18  
 6     6    11.6 B       CC         17  
 7     7    20   A       AA         19  
 8     8    15   B       BB         19.2
 9     9    12   A       BB         17  
10    10    11.6 A       AA         23

As we can see, the missing values have been imputed. You can see an example here

回复收藏 0 原文

~没有更多了~