报告 data.frame 中缺失值的优雅方法

发布于 2024-12-18 18:37:01 字数 316 浏览 5 评论 0原文

这是我编写的一小段代码,用于报告数据框中缺少值的变量。我正在尝试想出一种更优雅的方法来做到这一点,一种可能返回 data.frame 的方法,但我陷入了困境:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

编辑:我正在处理包含数十个到数百个变量的 data.frames,所以它是关键我们只报告缺失值的变量。

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

遮了一弯 2024-12-25 18:37:01

只需使用 sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

您还可以在 is.na() 创建的矩阵上使用 applycolSums

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 

Just use sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

You could also use apply or colSums on the matrix created by is.na()

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 
妄断弥空 2024-12-25 18:37:01

我最喜欢的(不太宽的)数据是来自优秀的 naniar 包的方法。您不仅可以获得频率,还可以获得缺失模式:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

在此处输入图像描述

查看缺失与非缺失的关系通常很有用,这可以通过绘制缺失的散点图来实现:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

在此处输入图像描述

或者对于分类变量:

gg_miss_fct(x = riskfactors, fct = marital)

在此处输入图像描述

这些示例来自包 vignette 列出了其他有趣的可视化。

My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

enter image description here

It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

enter image description here

Or for categorical variables:

gg_miss_fct(x = riskfactors, fct = marital)

enter image description here

These examples are from package vignette that lists other interesting visualizations.

野鹿林 2024-12-25 18:37:01

我们可以将 map_df 与 purrr 一起使用。

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

We can use map_df with purrr.

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0
鹿港小镇 2024-12-25 18:37:01
summary(airquality)

已经为您提供了此信息

VIM 软件包还为 data.frame

library("VIM")
aggr(airquality)

在此处输入图像描述

summary(airquality)

already gives you this information

The VIM packages also offers some nice missing data plot for data.frame

library("VIM")
aggr(airquality)

enter image description here

向地狱狂奔 2024-12-25 18:37:01

另一个图形替代方案 - 来自优秀 DataExplorer 包的 plot_missing 函数:

在此处输入图像描述

Docs 还指出您可以保存这会导致使用 missing_data <-plot_missing(data) 进行额外分析。

Another graphical alternative - plot_missing function from excellent DataExplorer package:

enter image description here

Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).

梦里的微风 2024-12-25 18:37:01

更简洁-: sum(is.na(x[1]))

  1. x[1] 看第一列

  2. is.na() true 如果它是 NA

  3. sum() TRUE1FALSE0

More succinct-: sum(is.na(x[1]))

That is

  1. x[1] Look at the first column

  2. is.na() true if it's NA

  3. sum() TRUE is 1, FALSE is 0

看透却不说透 2024-12-25 18:37:01

另一个可以帮助您查看丢失数据的函数是 funModeling 库

library(funModeling)

iris.2 中的 df_status ,它是添加了一些 NA 的 iris 数据集。您可以将其替换为您的数据集。

df_status(iris.2)

这将为您提供每列中 NA 的数量和百分比。

Another function that would help you look at missing data would be df_status from funModeling library

library(funModeling)

iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.

df_status(iris.2)

This will give you the number and percentage of NAs in each column.

毁梦 2024-12-25 18:37:01

对于另一种图形解决方案,visdat package 提供 vis_miss.

library(visdat)
vis_miss(airquality)

输入图片此处描述

Amelia 输出非常相似,略有不同,即在开箱即用的缺失情况下给出 %s。

For one more graphical solution, visdat package offers vis_miss.

library(visdat)
vis_miss(airquality)

enter image description here

Very similar to Amelia output with a small difference of giving %s on missings out of the box.

陈年往事 2024-12-25 18:37:01

我认为 Amelia 库在处理丢失数据方面做得很好,还包括一个用于可视化丢失行的地图。

install.packages("Amelia")
library(Amelia)
missmap(airquality)

输入图片这里的描述

你也可以运行下面的代码将返回na的逻辑值

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.

install.packages("Amelia")
library(Amelia)
missmap(airquality)

enter image description here

You can also run the following code will return the logic values of na

row.has.na <- apply(training, 1, function(x){any(is.na(x))})
残龙傲雪 2024-12-25 18:37:01

另一种图形和交互式方式是使用 heatmaply 库中的 is.na10 函数:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

在此处输入图像描述

可能不适用于大型数据集。

Another graphical and interactive way is to use is.na10 function from heatmaply library:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

enter image description here

Probably won't work well with large datasets..

傲娇萝莉攻 2024-12-25 18:37:01

获取计数的 dplyr 解决方案可能是:

summarise_all(df, ~sum(is.na(.)))

或者获取百分比:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

也许还值得注意的是,丢失的数据可能会很丑陋、不一致,并且并不总是编码为 NA,具体取决于关于来源或导入时的处理方式。可以根据您的数据和您想要考虑丢失的内容来调整以下函数:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5

A dplyr solution to get the count could be:

summarise_all(df, ~sum(is.na(.)))

Or to get a percentage:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5
东走西顾 2024-12-25 18:37:01

如果您想对特定列执行此操作,那么您也可以使用此

length(which(is.na(airquality[1])==T))

If you want to do it for particular column, then you can also use this

length(which(is.na(airquality[1])==T))
薄荷→糖丶微凉 2024-12-25 18:37:01

可以使用ExPanDaR的封装函数prepare_missing_values_graph探索面板数据:

在此处输入图像描述

ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

enter image description here

掩于岁月 2024-12-25 18:37:01

对于管道你可以这样写:

# Counts 
df %>% is.na() %>% colSums()

# % of missing rounded to 2 decimals 
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2)) 

For piping you could write:

# Counts 
df %>% is.na() %>% colSums()

# % of missing rounded to 2 decimals 
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2)) 
妞丶爷亲个 2024-12-25 18:37:01

summary(airquality) 默认显示 NA,与矢量的 table() 不同,矢量需要 useNA = "ifany"。 (错误:不要在数据帧上尝试 table() ,否则可能会出现内存泄漏。)

我最喜欢的总结数据帧值的新方法是使用 < a href="https://docs.ropensci.org/skimr/index.html" rel="nofollow noreferrer">skimr:

> skim(airquality)
── Data Summary ────────────────────────
                           Values    
Name                       airquality
Number of rows             153       
Number of columns          6         
_______________________              
Column type frequency:               
  numeric                  6         
________________________             
Group variables            None      

── Variable type: numeric ────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   mean    sd   p0   p25   p50   p75  p100 hist 
1 Ozone                37         0.758  42.1  33.0   1    18    31.5  63.2 168   ▇▃▂▁▁
2 Solar.R               7         0.954 186.   90.1   7   116.  205   259.  334   ▅▃▅▇▅
3 Wind                  0         1       9.96  3.52  1.7   7.4   9.7  11.5  20.7 ▂▇▇▃▁
4 Temp                  0         1      77.9   9.47 56    72    79    85    97   ▂▃▇▇▃
5 Month                 0         1       6.99  1.42  5     6     7     8     9   ▇▇▇▇▇
6 Day                   0         1      15.8   8.86  1     8    16    23    31   ▇▇▇▇▆

除了打印的摘要之外,您还可以获取从返回的数据帧形式的摘要统计信息skim()。您还可以自定义使用 sfl() 报告的统计信息。

summary(airquality) shows NAs by default, unlike table() for vectors which requires useNA = "ifany". (Bug: don't try table() on a dataframe or you may get a memory leak.)

My new favorite way to summarize dataframe values, with n_missing and complete_rate for all column types, is with skimr:

> skim(airquality)
── Data Summary ────────────────────────
                           Values    
Name                       airquality
Number of rows             153       
Number of columns          6         
_______________________              
Column type frequency:               
  numeric                  6         
________________________             
Group variables            None      

── Variable type: numeric ────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   mean    sd   p0   p25   p50   p75  p100 hist 
1 Ozone                37         0.758  42.1  33.0   1    18    31.5  63.2 168   ▇▃▂▁▁
2 Solar.R               7         0.954 186.   90.1   7   116.  205   259.  334   ▅▃▅▇▅
3 Wind                  0         1       9.96  3.52  1.7   7.4   9.7  11.5  20.7 ▂▇▇▃▁
4 Temp                  0         1      77.9   9.47 56    72    79    85    97   ▂▃▇▇▃
5 Month                 0         1       6.99  1.42  5     6     7     8     9   ▇▇▇▇▇
6 Day                   0         1      15.8   8.86  1     8    16    23    31   ▇▇▇▇▆

Aside from the printed summary, you can also get summary statistics as a dataframe returned from skim(). You can also customize the statistics reported with sfl().

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文