报告 data.frame 中缺失值的优雅方法
这是我编写的一小段代码,用于报告数据框中缺少值的变量。我正在尝试想出一种更优雅的方法来做到这一点,一种可能返回 data.frame 的方法,但我陷入了困境:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
编辑:我正在处理包含数十个到数百个变量的 data.frames,所以它是关键我们只报告缺失值的变量。
Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
只需使用
sapply
您还可以在
is.na()
创建的矩阵上使用apply
或colSums
Just use
sapply
You could also use
apply
orcolSums
on the matrix created byis.na()
我最喜欢的(不太宽的)数据是来自优秀的 naniar 包的方法。您不仅可以获得频率,还可以获得缺失模式:
查看缺失与非缺失的关系通常很有用,这可以通过绘制缺失的散点图来实现:
或者对于分类变量:
这些示例来自包 vignette 列出了其他有趣的可视化。
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
Or for categorical variables:
These examples are from package vignette that lists other interesting visualizations.
我们可以将
map_df
与 purrr 一起使用。We can use
map_df
with purrr.已经为您提供了此信息
VIM 软件包还为 data.frame
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
另一个图形替代方案 - 来自优秀
DataExplorer
包的plot_missing
函数:Docs 还指出您可以保存这会导致使用
missing_data <-plot_missing(data)
进行额外分析。Another graphical alternative -
plot_missing
function from excellentDataExplorer
package:Docs also points out to the fact that you can save this results for additional analysis with
missing_data <- plot_missing(data)
.更简洁-:
sum(is.na(x[1]))
即
x[1]
看第一列is.na()
true 如果它是NA
sum()
TRUE
为1
,FALSE
为0
More succinct-:
sum(is.na(x[1]))
That is
x[1]
Look at the first columnis.na()
true if it'sNA
sum()
TRUE
is1
,FALSE
is0
另一个可以帮助您查看丢失数据的函数是 funModeling 库
iris.2 中的 df_status ,它是添加了一些 NA 的 iris 数据集。您可以将其替换为您的数据集。
这将为您提供每列中 NA 的数量和百分比。
Another function that would help you look at missing data would be df_status from funModeling library
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
This will give you the number and percentage of NAs in each column.
对于另一种图形解决方案,
visdat
package 提供vis_miss.
与
Amelia
输出非常相似,略有不同,即在开箱即用的缺失情况下给出 %s。For one more graphical solution,
visdat
package offersvis_miss
.Very similar to
Amelia
output with a small difference of giving %s on missings out of the box.我认为 Amelia 库在处理丢失数据方面做得很好,还包括一个用于可视化丢失行的地图。
你也可以运行下面的代码将返回na的逻辑值
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
You can also run the following code will return the logic values of na
另一种图形和交互式方式是使用 heatmaply 库中的 is.na10 函数:
可能不适用于大型数据集。
Another graphical and interactive way is to use
is.na10
function fromheatmaply
library:Probably won't work well with large datasets..
获取计数的 dplyr 解决方案可能是:
或者获取百分比:
也许还值得注意的是,丢失的数据可能会很丑陋、不一致,并且并不总是编码为 NA,具体取决于关于来源或导入时的处理方式。可以根据您的数据和您想要考虑丢失的内容来调整以下函数:
A
dplyr
solution to get the count could be:Or to get a percentage:
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as
NA
depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:如果您想对特定列执行此操作,那么您也可以使用此
If you want to do it for particular column, then you can also use this
可以使用ExPanDaR的封装函数
prepare_missing_values_graph
探索面板数据:ExPanDaR’s package function
prepare_missing_values_graph
can be used to explore panel data:对于管道你可以这样写:
For piping you could write:
summary(airquality)
默认显示 NA,与矢量的table()
不同,矢量需要useNA = "ifany"
。 (错误:不要在数据帧上尝试table()
,否则可能会出现内存泄漏。)我最喜欢的总结数据帧值的新方法是使用 < a href="https://docs.ropensci.org/skimr/index.html" rel="nofollow noreferrer">skimr:
除了打印的摘要之外,您还可以获取从返回的数据帧形式的摘要统计信息
skim()
。您还可以自定义使用sfl()
报告的统计信息。summary(airquality)
shows NAs by default, unliketable()
for vectors which requiresuseNA = "ifany"
. (Bug: don't trytable()
on a dataframe or you may get a memory leak.)My new favorite way to summarize dataframe values, with n_missing and complete_rate for all column types, is with skimr:
Aside from the printed summary, you can also get summary statistics as a dataframe returned from
skim()
. You can also customize the statistics reported withsfl()
.