在大型数据表中替换 NA 的最快方法
我有一个很大的 data.table,其中缺少许多值分散在大约 20 万行和 200 列中。我想尽可能有效地将这些 NA 值重新编码为零。
我看到两个选项:
1:转换为data.frame,并使用一些这样
2:某种很酷的 data.table 子设置命令
我会对类型 1 的相当有效的解决方案感到满意。转换为 data.frame 然后返回 data.table 不会花费太长时间。
I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible.
I see two options:
1: Convert to a data.frame, and use something like this
2: Some kind of cool data.table sub setting command
I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
这是一个使用 data.table 的
:=
运算符的解决方案,建立在 Andrie 和 Ramnath 的基础上答案。请注意,f_dowle 通过引用更新了 dt1。如果需要本地副本,则需要显式调用
copy
函数来制作整个数据集的本地副本。 data.table 的setkey
、key<-
和:=
不进行写入时复制。接下来,让我们看看 f_dowle 都把时间花在哪里了。
在那里,我将重点关注
na.replace
和is.na
,其中有一些矢量副本和矢量扫描。通过编写一个小的 na.replace C 函数,通过向量中的引用更新NA
,可以很容易地消除这些问题。我认为这至少可以将 20 秒时间缩短一半。 R包中是否存在这样的函数?f_andrie
失败的原因可能是因为它复制了整个dt1
,或者创建了一个与整个dt1
一样大的逻辑矩阵,几个次。其他 2 种方法一次只适用于一列(尽管我只是简单地查看了NAToUnknown
)。编辑(根据 Ramnath 在评论中要求的更优雅的解决方案):
我希望我一开始就这样做!
EDIT2(一年多后,现在)
还有
set()
。如果有很多列被循环,这会更快,因为它避免了在循环中调用[,:=,]
的(小)开销。set
是一个可循环的:=
。请参阅?设置
。Here's a solution using data.table's
:=
operator, building on Andrie and Ramnath's answers.Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the
copy
function is needed to make a local copy of the whole dataset. data.table'ssetkey
,key<-
and:=
do not copy-on-write.Next, let's see where f_dowle is spending its time.
There, I would focus on
na.replace
andis.na
, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updatesNA
by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?The reason
f_andrie
fails may be because it copies the whole ofdt1
, or creates a logical matrix as big as the whole ofdt1
, a few times. The other 2 methods work on one column at a time (although I only briefly looked atNAToUnknown
).EDIT (more elegant solution as requested by Ramnath in comments) :
I wish I did it that way to start with!
EDIT2 (over 1 year later, now)
There is also
set()
. This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling[,:=,]
in a loop.set
is a loopable:=
. See?set
.这是我能想到的最简单的一个:
dt[is.na(dt)] <- 0
它非常高效,不需要编写函数和其他粘合代码。
Here's the simplest one I could come up with:
dt[is.na(dt)] <- 0
It's efficient and no need to write functions and other glue code.
用于此目的的专用函数(
nafill
和setnafill
)可在data.table
包(版本 >= 1.12.4)中找到:它处理并行列很好地解决了之前发布的基准,低于其计时与迄今为止最快的方法,并且还使用 40 核机器进行了扩展。
Dedicated functions (
nafill
andsetnafill
) for that purpose are available indata.table
package (version >= 1.12.4):It process columns in parallel so well address previously posted benchmarks, below its timings vs fastest approach till now, and also scaled up, using 40 cores machine.
仅供参考,与 gdata 或 data.matrix 相比速度较慢,但仅使用 data.table 包并且可以处理非数字条目。
Just for reference, slower compared to gdata or data.matrix, but uses only the data.table package and can deal with non numerical entries.
这是使用
gdata
包中的NAToUnknown
的解决方案。我已经使用 Andrie 的解决方案创建了一个巨大的数据表,并且还包括与 Andrie 的解决方案的时间比较。Here is a solution using
NAToUnknown
in thegdata
package. I have used Andrie's solution to create a huge data table and also included time comparisons with Andrie's solution.我的理解是,R 中快速运算的秘诀是利用向量(或数组,它们是底层的向量。)
在这个解决方案中,我使用了
data.matrix
,它是一个array
但行为有点像data.frame
。因为它是一个数组,所以您可以使用非常简单的向量替换来替换NA
:一个用于删除
NA
的小辅助函数。本质是一行代码。我这样做只是为了测量执行时间。一个小辅助函数,用于创建给定大小的
data.table
。小样本演示:
My understanding is that the secret to fast operations in R is to utilise vector (or arrays, which are vectors under the hood.)
In this solution I make use of a
data.matrix
which is anarray
but behave a bit like adata.frame
. Because it is an array, you can use a very simple vector substitution to replace theNA
s:A little helper function to remove the
NA
s. The essence is a single line of code. I only do this to measure execution time.A little helper function to create a
data.table
of a given size.Demonstration on a tiny sample:
为了完整起见,用 0 替换 NA 的另一种方法是使用
来比较结果和时间,我已经合并了到目前为止提到的所有方法。
因此,新方法比 f_dowle3 稍慢,但比所有其他方法更快。但说实话,这违背了我对 data.table 语法的直觉,我不知道为什么会这样。有人可以启发我吗?
For the sake of completeness, another way to replace NAs with 0 is to use
To compare results and times I have incorporated all approaches mentioned so far.
So the new approach is slightly slower than
f_dowle3
but faster than all the other approaches. But to be honest, this is against my Intuition of the data.table Syntax and I have no idea why this works. Can anybody enlighten me?使用最新
data.table
版本 1.12.6 中的fifelse
函数,它甚至比gdata 中的
NAToUnknown
快 10 倍包:Using the
fifelse
function from the newestdata.table
versions 1.12.6, it is even 10 times faster thanNAToUnknown
in thegdata
package:要推广到许多列,您可以使用这种方法(使用以前的示例数据但添加一列):
但没有测试速度
To generalize to many columns you could use this approach (using previous sample data but adding a column):
Didn't test for the speed though
一个快速替代方案是
collapse::replace_NA
,默认情况下将 NA 替换为 0。在具有 10 列(每行 1M 行)和 10% NA 的 data.frame 上进行微基准测试。
代码:
A fast alternative is
collapse::replace_NA
, which by default replaces NAs with 0.Microbenchmark on a data.frame with 10 columns of 1M rows and 10% NAs.
Code: