为什么跑步是“独特的”?数据帧比 R 中的矩阵更快?
我开始相信,除了符号方便之外,数据框与矩阵相比没有任何优势。然而,当我在矩阵和数据帧上运行unique
时,我注意到了这个奇怪的现象:它似乎在数据帧上运行得更快。
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
随着行数的增加,计时结果的差异甚至更大。所以,这个问题有两个部分。
为什么矩阵的速度较慢?转换为数据帧、运行
unique
,然后再转换回来似乎更快。是否有任何理由不将
unique
包装在myUnique
中,它在第 #1 部分中进行转换?
注意 1. 鉴于矩阵是原子的,对于矩阵来说,unique
似乎应该更快,而不是更慢。能够迭代固定大小的连续内存块通常应该比运行单独的链表块更快(我假设这就是数据帧的实现方式......)。
注 2:正如 data.table
的性能所证明的,在数据框或矩阵上运行 unique
是一个相对糟糕的主意 - 请参阅 Matthew Dowle 的答案和相对时间的评论。我已经将很多对象迁移到数据表中,这种性能是这样做的另一个原因。因此,尽管用户应该很好地采用数据表,但出于教学/社区的原因,我暂时保留关于为什么这在矩阵对象上需要更长的时间的问题。下面的答案解决了时间都花在哪里,以及我们如何才能获得更好的性能(即数据表)。 为什么的答案就在眼前——可以通过unique.data.frame
和unique.matrix
找到代码。 :) 关于它正在做什么的英文解释&为什么这一切都缺乏。
I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique
on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time({
u1 = unique(a)
})
user system elapsed
1.840 0.000 1.846
system.time({
u2 = unique(b)
})
user system elapsed
0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run
unique
, and then convert back.Is there any reason not to just wrap
unique
inmyUnique
, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique
should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table
, running unique
on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame
and unique.matrix
. :) An English explanation of what it's doing & why is all that is lacking.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在此实现中,
unique.matrix
与unique.array
<代码>>相同(唯一.数组,唯一.矩阵)
[1] TRUE
unique.array
必须处理多维数组,这需要额外的处理“折叠”二维情况下不需要的额外维度(对paste()
的额外调用)。代码的关键部分是:折叠 <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if(折叠)
apply(x, MARGIN, function(x) Paste(x,collapse = "\r"))
unique.data.frame
针对 2D 情况进行了优化,unique.matrix
不是。正如您所建议的,它可能只是不在当前的实现中。请注意,在所有具有多个维度的情况 (unique.{array,matrix,data.table}) 中,比较唯一性的是字符串表示形式。对于浮点数,这意味着 15 位十进制数字,因此
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
为
1
whileNROW(unique(a <- 矩阵(rep(c(1, 1+5e-15), 2), nrow = 2)))
和
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
都是<代码>2。您确定
独特
就是您想要的吗?In this implementation,
unique.matrix
is the same asunique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array
has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls topaste()
) which are not needed in the 2-dimensional case. The key section of code is:collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame
is optimised for the 2D case,unique.matrix
is not. It could be, as you suggest, it just isn't in the current implementation.Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is
1
whileNROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both
2
. Are you sureunique
is what you want?不确定,但我猜因为
matrix
是一个连续的向量,R首先将其复制到列向量中(如data.frame
),因为paste< /code> 需要一个向量列表。请注意,两者都很慢,因为都使用
paste
。也许是因为
unique.data.table
已经快了很多倍。请从 R-Forge 存储库下载升级到 v1.6.7,因为它修复了您在 这个问题。data.table
不使用paste
来执行unique
。Not sure but I guess that because
matrix
is one contiguous vector, R copies it into column vectors first (like adata.frame
) becausepaste
needs a list of vectors. Note that both are slow because both usepaste
.Perhaps because
unique.data.table
is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix tounique
you raised in this question.data.table
doesn't usepaste
to dounique
.在尝试回答我自己的问题时,尤其是第 1 部分,我们可以通过查看 Rprof 的结果来了解时间花在哪里。我再次运行了这个,有 5M 个元素。
以下是第一个唯一操作的结果(对于矩阵):
对于数据框:
我们注意到矩阵版本在
apply
、paste和<代码>lapply。相比之下,数据帧版本简单运行
duplicated.data.frame
,大部分时间花在paste
上,大概是聚合结果。虽然这解释了时间的去向,但它并没有解释为什么它们有不同的实现,也没有解释简单地从一种对象类型更改为另一种对象类型的效果。
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of
Rprof
. I ran this again, with 5M elements.Here are the results for the first unique operation (for the matrix):
And for the data frame:
What we notice is that the matrix version spends a lot of time on
apply
,paste
, andlapply
. In contrast, the data frame version simple runsduplicated.data.frame
and most of the time is spent inpaste
, presumably aggregating results.Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.