为什么跑步是“独特的”？数据帧比 R 中的矩阵更快？

发布于 2024-12-10 17:22:45 字数 1076 浏览 7 评论 0原文

我开始相信，除了符号方便之外，数据框与矩阵相比没有任何优势。然而，当我在矩阵和数据帧上运行unique时，我注意到了这个奇怪的现象：它似乎在数据帧上运行得更快。

a   = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b   = as.data.frame(a)

system.time({
    u1 = unique(a)
})
 user  system elapsed
1.840   0.000   1.846


system.time({
    u2 = unique(b)
})
 user  system elapsed
0.380   0.000   0.379

随着行数的增加，计时结果的差异甚至更大。所以，这个问题有两个部分。

为什么矩阵的速度较慢？转换为数据帧、运行unique，然后再转换回来似乎更快。
是否有任何理由不将 unique 包装在 myUnique 中，它在第 #1 部分中进行转换？

注意 1. 鉴于矩阵是原子的，对于矩阵来说，unique 似乎应该更快，而不是更慢。能够迭代固定大小的连续内存块通常应该比运行单独的链表块更快（我假设这就是数据帧的实现方式......）。

注 2：正如 data.table 的性能所证明的，在数据框或矩阵上运行 unique 是一个相对糟糕的主意 - 请参阅 Matthew Dowle 的答案和相对时间的评论。我已经将很多对象迁移到数据表中，这种性能是这样做的另一个原因。因此，尽管用户应该很好地采用数据表，但出于教学/社区的原因，我暂时保留关于为什么这在矩阵对象上需要更长的时间的问题。下面的答案解决了时间都花在哪里，以及我们如何才能获得更好的性能（即数据表）。 为什么的答案就在眼前——可以通过unique.data.frame和unique.matrix找到代码。 :) 关于它正在做什么的英文解释&为什么这一切都缺乏。

原文

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.

a   = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b   = as.data.frame(a)

system.time({
    u1 = unique(a)
})
 user  system elapsed
1.840   0.000   1.846


system.time({
    u2 = unique(b)
})
 user  system elapsed
0.380   0.000   0.379

The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.

Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?

Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).

Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柏拉图鍀咏恒 2024-12-17 17:22:45

在此实现中，unique.matrix 与 unique.array
<代码>>相同（唯一.数组，唯一.矩阵）
[1] TRUE
unique.array 必须处理多维数组，这需要额外的处理“折叠”二维情况下不需要的额外维度（对 paste() 的额外调用）。代码的关键部分是：
折叠 <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if（折叠） apply(x, MARGIN, function(x) Paste(x,collapse = "\r"))
unique.data.frame 针对 2D 情况进行了优化，unique.matrix 不是。正如您所建议的，它可能只是不在当前的实现中。

请注意，在所有具有多个维度的情况 (unique.{array,matrix,data.table}) 中，比较唯一性的是字符串表示形式。对于浮点数，这意味着 15 位十进制数字，因此

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

为1 while

NROW(unique(a <- 矩阵(rep(c(1, 1+5e-15), 2), nrow = 2)))

和

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

都是<代码>2。您确定独特就是您想要的吗？

回复收藏 0 原文

魔 2024-12-17 17:22:45

不确定，但我猜因为matrix是一个连续的向量，R首先将其复制到列向量中（如data.frame），因为paste< /code> 需要一个向量列表。请注意，两者都很慢，因为都使用 paste。
也许是因为 unique.data.table 已经快了很多倍。请从 R-Forge 存储库下载升级到 v1.6.7，因为它修复了您在这个问题。 data.table 不使用 paste 来执行 unique。

a = 矩阵(样本(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
系统时间(u1<-unique(a))
   用户系统已失效 
   2.98 0.00 2.99 
系统时间(u2<-unique(b))
   用户系统已失效 
   0.99 0.00 0.99 
c = as.data.table(b)
系统时间(u3<-unique(c))
   用户系统已失效 
   0.03 0.02 0.05 # 比u1快60倍，比u2快20倍
相同（as.data.table（u2），u3）
[1] 正确

Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.

a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
   user  system elapsed 
   2.98    0.00    2.99 
system.time(u2<-unique(b))
   user  system elapsed 
   0.99    0.00    0.99 
c = as.data.table(b)
system.time(u3<-unique(c))
   user  system elapsed 
   0.03    0.02    0.05  # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE

回复收藏 0 原文

携余温的黄昏 2024-12-17 17:22:45

在尝试回答我自己的问题时，尤其是第 1 部分，我们可以通过查看 Rprof 的结果来了解时间花在哪里。我再次运行了这个，有 5M 个元素。

以下是第一个唯一操作的结果（对于矩阵）：

> summaryRprof("u1.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   5.70    52.58       5.96     54.98
"apply"                   2.70    24.91      10.68     98.52
"FUN"                     0.86     7.93       6.82     62.92
"lapply"                  0.82     7.56       1.00      9.23
"list"                    0.30     2.77       0.30      2.77
"!"                       0.14     1.29       0.14      1.29
"c"                       0.10     0.92       0.10      0.92
"unlist"                  0.08     0.74       1.08      9.96
"aperm.default"           0.06     0.55       0.06      0.55
"is.null"                 0.06     0.55       0.06      0.55
"duplicated.default"      0.02     0.18       0.02      0.18

$by.total
                     total.time total.pct self.time self.pct
"unique"                  10.84    100.00      0.00     0.00
"unique.matrix"           10.84    100.00      0.00     0.00
"apply"                   10.68     98.52      2.70    24.91
"FUN"                      6.82     62.92      0.86     7.93
"paste"                    5.96     54.98      5.70    52.58
"unlist"                   1.08      9.96      0.08     0.74
"lapply"                   1.00      9.23      0.82     7.56
"list"                     0.30      2.77      0.30     2.77
"!"                        0.14      1.29      0.14     1.29
"do.call"                  0.14      1.29      0.00     0.00
"c"                        0.10      0.92      0.10     0.92
"aperm.default"            0.06      0.55      0.06     0.55
"is.null"                  0.06      0.55      0.06     0.55
"aperm"                    0.06      0.55      0.00     0.00
"duplicated.default"       0.02      0.18      0.02     0.18

$sample.interval
[1] 0.02

$sampling.time
[1] 10.84

对于数据框：

> summaryRprof("u2.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   1.72    94.51       1.72     94.51
"[.data.frame"            0.06     3.30       1.82    100.00
"duplicated.default"      0.04     2.20       0.04      2.20

$by.total
                        total.time total.pct self.time self.pct
"[.data.frame"                1.82    100.00      0.06     3.30
"["                           1.82    100.00      0.00     0.00
"unique"                      1.82    100.00      0.00     0.00
"unique.data.frame"           1.82    100.00      0.00     0.00
"duplicated"                  1.76     96.70      0.00     0.00
"duplicated.data.frame"       1.76     96.70      0.00     0.00
"paste"                       1.72     94.51      1.72    94.51
"do.call"                     1.72     94.51      0.00     0.00
"duplicated.default"          0.04      2.20      0.04     2.20

$sample.interval
[1] 0.02

$sampling.time
[1] 1.82

我们注意到矩阵版本在 apply、paste和<代码>lapply。相比之下，数据帧版本简单运行duplicated.data.frame，大部分时间花在paste上，大概是聚合结果。

虽然这解释了时间的去向，但它并没有解释为什么它们有不同的实现，也没有解释简单地从一种对象类型更改为另一种对象类型的效果。

In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.

Here are the results for the first unique operation (for the matrix):

> summaryRprof("u1.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   5.70    52.58       5.96     54.98
"apply"                   2.70    24.91      10.68     98.52
"FUN"                     0.86     7.93       6.82     62.92
"lapply"                  0.82     7.56       1.00      9.23
"list"                    0.30     2.77       0.30      2.77
"!"                       0.14     1.29       0.14      1.29
"c"                       0.10     0.92       0.10      0.92
"unlist"                  0.08     0.74       1.08      9.96
"aperm.default"           0.06     0.55       0.06      0.55
"is.null"                 0.06     0.55       0.06      0.55
"duplicated.default"      0.02     0.18       0.02      0.18

$by.total
                     total.time total.pct self.time self.pct
"unique"                  10.84    100.00      0.00     0.00
"unique.matrix"           10.84    100.00      0.00     0.00
"apply"                   10.68     98.52      2.70    24.91
"FUN"                      6.82     62.92      0.86     7.93
"paste"                    5.96     54.98      5.70    52.58
"unlist"                   1.08      9.96      0.08     0.74
"lapply"                   1.00      9.23      0.82     7.56
"list"                     0.30      2.77      0.30     2.77
"!"                        0.14      1.29      0.14     1.29
"do.call"                  0.14      1.29      0.00     0.00
"c"                        0.10      0.92      0.10     0.92
"aperm.default"            0.06      0.55      0.06     0.55
"is.null"                  0.06      0.55      0.06     0.55
"aperm"                    0.06      0.55      0.00     0.00
"duplicated.default"       0.02      0.18      0.02     0.18

$sample.interval
[1] 0.02

$sampling.time
[1] 10.84

And for the data frame:

> summaryRprof("u2.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   1.72    94.51       1.72     94.51
"[.data.frame"            0.06     3.30       1.82    100.00
"duplicated.default"      0.04     2.20       0.04      2.20

$by.total
                        total.time total.pct self.time self.pct
"[.data.frame"                1.82    100.00      0.06     3.30
"["                           1.82    100.00      0.00     0.00
"unique"                      1.82    100.00      0.00     0.00
"unique.data.frame"           1.82    100.00      0.00     0.00
"duplicated"                  1.76     96.70      0.00     0.00
"duplicated.data.frame"       1.76     96.70      0.00     0.00
"paste"                       1.72     94.51      1.72    94.51
"do.call"                     1.72     94.51      0.00     0.00
"duplicated.default"          0.04      2.20      0.04     2.20

$sample.interval
[1] 0.02

$sampling.time
[1] 1.82

What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.

Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

回复收藏 0 原文

~没有更多了~