为什么转换为``列表''改善了``lapply''的性能?

发布于 2025-01-20 05:08:18 字数 947 浏览 0 评论 0原文

我惊讶地发现第一行的运行速度比第二行慢得多,第二行的性能可疑地接近矢量化版本。如果处理列表比处理 numeric(n) 向量快得多,为什么 R 不自动将其输入转换为列表?

> system.time(lapply(1:10^7, sqrt))
   user  system elapsed
  4.445   0.204   4.692
> system.time(lapply(list(1:10^7), sqrt))
   user  system elapsed
  0.048   0.015   0.062
> system.time(sqrt(1:10^7))
   user  system elapsed
   0.04    0.00    0.04

这是版本信息

$ R --version
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin21.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

$ sw_vers
ProductName:    macOS
ProductVersion: 12.3.1
BuildVersion:   21E258

I am surprised to see the first line runs much slower compared to the second one, which is suspiciously close in performance to the vectorized version. If processing a list is so much faster than processing a numeric(n) vector, why doesn't R convert its input to a list automatically?

> system.time(lapply(1:10^7, sqrt))
   user  system elapsed
  4.445   0.204   4.692
> system.time(lapply(list(1:10^7), sqrt))
   user  system elapsed
  0.048   0.015   0.062
> system.time(sqrt(1:10^7))
   user  system elapsed
   0.04    0.00    0.04

Here is the version information

$ R --version
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin21.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

$ sw_vers
ProductName:    macOS
ProductVersion: 12.3.1
BuildVersion:   21E258

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

满身野味 2025-01-27 05:08:18

原因是第二个表达式只是一个长度为 1 的 list

> length(list(1:10^7))
[1] 1

,这与直接应用 sqrt 基本相同。相反,如果我们想纯粹对 list 的每个元素执行此操作,则需要 as.list 而不是 list

> length(as.list(1:10^7))
[1] 10000000

转换为 <如果目的是循环遍历向量的每个元素,则不需要 vector 中的 code>list 。在向量中,每个元素都是一个单元(与矩阵相同 - 仅具有dim属性),但在data.frame/tibble/中data.table,每个单元是一列。因此,lapply 循环遍历 data.frame 中的单元,即列,其中作为 vector 中的单个元素。当我们用 list 包装一个向量时,它将整个向量封装为单个 list 元素

> list(1:3)
[[1]]
[1] 1 2 3

> as.list(1:3)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

由于 sqrt 是一个向量化函数,当我们通过循环第一个列表来应用sqrt,它只循环一次,但在第二个列表中,它循环多次。


因此,我们得到了类似的计时(当然,额外的计时是将矢量转换为 listas.list

>  system.time(lapply(1:10^7, sqrt))
   user  system elapsed 
  4.364   0.220   4.748 
> system.time(lapply(as.list(1:10^7), sqrt))
   user  system elapsed 
  4.882   0.367   5.518 

更快的选择是使用 vapply (如果我们在循环上应用非向量化函数)

> system.time(vapply(1:10^7, sqrt, numeric(1)))
   user  system elapsed 
  2.464   0.172   2.633 

The reason is that the second expression is just a list of length 1

> length(list(1:10^7))
[1] 1

which is basically the same as applying sqrt directly. Instead, if we want to do this purely on each element of a list, it would require as.list instead of list i.e.

> length(as.list(1:10^7))
[1] 10000000

Converting to list from vector is unnecessary if the intention is to loop over each element of vector. In a vector, each element is a unit (same with matrix - only having dim attributes), but in a data.frame/tibble/data.table, each unit is a column. Thus, lapply loops over the unit i.e. column in data.frame where as the single element in a vector. When we wrap a vector with list, it is encapsulating the whole vector as a single list element

> list(1:3)
[[1]]
[1] 1 2 3

> as.list(1:3)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

As sqrt is a vectorized function, the when we apply the sqrt by looping over the first list, it loops only once, but in second, it loops multiple times.


Thus, we get similar timings (of course the extra timing will be to convert the vector to list with as.list)

>  system.time(lapply(1:10^7, sqrt))
   user  system elapsed 
  4.364   0.220   4.748 
> system.time(lapply(as.list(1:10^7), sqrt))
   user  system elapsed 
  4.882   0.367   5.518 

A faster option would be to use vapply (if we are applying non-vectorized functions on a loop)

> system.time(vapply(1:10^7, sqrt, numeric(1)))
   user  system elapsed 
  2.464   0.172   2.633 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文