替换 R 中缺失值的均值或众数
我有一个由混合数据类型(数字、 字符、因子、序数因子)缺少值,而我是 尝试创建一个 for 循环来替换缺失的值 如果是数值,则使用相应列的平均值;如果是字符/因子,则使用众数。
这就是我到目前为止所拥有的:
#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)
for (var in 1:ncol(df_test)) {
if (class(df_test[,var])=="numeric") {
df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
}
}
其中“Mode”是函数:
Mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1)
xmode <- ">1 mode"
return(xmode)
}
看起来它只是忽略了这些语句,而不给出 任何错误... 我还尝试使用索引来处理第一部分:
## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]
但我收到此错误:“colMeans(df_test, na.rm = TRUE) 中的错误:‘x’必须是数字”
有人知道如何解决这个问题吗?
非常感谢大家的大力帮助! -f
I have a large database made up of mixed data types (numeric,
character, factor, ordinal factor) with missing values, and I am
trying to create a for loop to substitute the missing values
using either the mean of the respective column if numerical or the mode if character/factor.
This is what I have until now:
#fake array:
age<- c(5,8,10,12,NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)
for (var in 1:ncol(df_test)) {
if (class(df_test[,var])=="numeric") {
df_test[is.na(df_test[,var]) <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]=="character") {
Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
}
}
Where 'Mode' is the function:
Mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1)
xmode <- ">1 mode"
return(xmode)
}
It seems as it is just ignoring the statements though, without giving
any error…
I have also tried to work the first part out with indexes:
## create an index of missing values
index <- which(is.na(df_test)[,1], arr.ind = TRUE)
## calculate the row means and "duplicate" them to assign to appropriate cells
df_test[index] <- colMeans(df_test, na.rm = TRUE) [index["column",]]
But I get this error: "Error in colMeans(df_test, na.rm = TRUE) : 'x' must be numeric"
Does anybody have any idea how to solve this?
Thank you very much for all the great help!
-f
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您只是删除明显的错误,那么它就会按预期工作:
我建议您使用具有语法突出显示和括号匹配功能的编辑器,这将使您更容易找到此类语法错误。
If you simply remove the obvious bugs then it works as intended:
I recommend that you use an editor with syntax highlighting and bracket matching, which would make it easier to find these sorts of syntax errors.
首先,您需要编写众数函数,考虑到分类数据的缺失值,其长度<1。
众数函数:
然后您可以迭代列,如果列是数字,则用均值填充缺失值,否则用众数填充缺失值。
下面的循环语句:
让我们提供一个示例:
带有缺失值的初始 df:
通过运行上面的 for 循环,我们得到:
正如我们所看到的,缺失值已被估算。您可以在此处查看示例
First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
The mode function:
Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.
The loop statement below:
Let's provide an example:
The initial df with the missing values:
By running the for loop above, we get:
As we can see, the missing values have been imputed. You can see an example here