Clojure / Incanter 中的快速矢量数学
我目前正在研究 Clojure 和 Incanter 作为 R 的替代品。(并不是说我不喜欢 R,而是尝试新语言很有趣。)我喜欢 Incanter 并且发现语法很有吸引力,但相比之下,矢量化操作相当慢例如 R 或 Python。
作为一个例子,我想获得向量的一阶差分 使用 Incanter 向量运算、Clojure 映射和 R 。以下是所有代码和时间 版本。正如您所看到的,R 显然更快。
Incanter 和 Clojure:
(use '(incanter core stats))
(def x (doall (sample-normal 1e7)))
(time (def y (doall (minus (rest x) (butlast x)))))
"Elapsed time: 16481.337 msecs"
(time (def y (doall (map - (rest x) (butlast x)))))
"Elapsed time: 16457.850 msecs"
R:
rdiff <- function(x){
n = length(x)
x[2:n] - x[1:(n-1)]}
x = rnorm(1e7)
system.time(rdiff(x))
user system elapsed
1.504 0.900 2.561
所以我想知道是否有办法加速 Incanter/Clojure 中的矢量运算?此外,还欢迎涉及使用循环、Java 数组和/或 Clojure 库的解决方案。
我也已将这个问题发布到 Incanter Google 群组,但到目前为止尚未得到回复。
更新:我已将 Jouni 的答案标记为已接受,请参阅下面我自己的答案,我已经清理了他的代码并添加了一些基准。
I'm currently looking into Clojure and Incanter as an alternative to R. (Not that I dislike R, but it just interesting to try out new languages.) I like Incanter and find the syntax appealing, but vectorized operations are quite slow as compared e.g. to R or Python.
As an example I wanted to get the first order difference of a vector
using Incanter vector operations, Clojure map and R . Below is the code and timing for all
versions. As you can see R is clearly faster.
Incanter and Clojure:
(use '(incanter core stats))
(def x (doall (sample-normal 1e7)))
(time (def y (doall (minus (rest x) (butlast x)))))
"Elapsed time: 16481.337 msecs"
(time (def y (doall (map - (rest x) (butlast x)))))
"Elapsed time: 16457.850 msecs"
R:
rdiff <- function(x){
n = length(x)
x[2:n] - x[1:(n-1)]}
x = rnorm(1e7)
system.time(rdiff(x))
user system elapsed
1.504 0.900 2.561
So I was wondering is there a way to speed up the vector operations in Incanter/Clojure? Also solutions involving the use of loops, Java arrays and/or libraries from Clojure are welcome.
I have also posted this question to Incanter Google group with no responses so far.
UPDATE: I have marked Jouni's answer as accepted, see below for my own answer where I have cleaned up his code a bit and added some benchmarks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我的最终解决方案
经过所有测试,我发现两种略有不同的方法可以以足够的速度进行计算。
首先,我使用了具有不同类型返回值的函数
diff
,下面是返回向量的代码,但我还计时了返回双数组的版本(将 (vec y) 替换为 y )和 Incanter.matrix(用矩阵 y 替换 (vec y))。该函数仅基于java数组。这是基于 Jouni 的代码,删除了一些额外的类型提示。另一种方法是使用 Java 数组进行计算并将值存储在瞬态向量中。正如您从计时中看到的,如果您不希望函数返回并数组,则这比方法 1 稍快。这是在函数
difft
中实现的。因此,选择实际上取决于您不想对数据做什么。我想一个好的选择是重载该函数,以便它返回与调用中使用的相同类型。实际上将 java 数组传递给 diff 而不是向量会使速度加快约 1 秒。
不同函数的时序:
返回向量的差异:
返回 Incanter.matrix 的差异:
返回双数组的差异:
差异:
功能
My final solutions
After all the testing I found two slightly different ways to do the calculation with sufficient speed.
First I've used the function
diff
with different types of return values, below is the code returning a vector, but I have also timed a version returning a double-array (replace (vec y) with y) and Incanter.matrix (replace (vec y) with matrix y). This function is only based on java arrays. This is based on Jouni's code with some extra type hints removed.Another approach is to do the calculations with Java arrays and store the values in a transient vector. As you see from the timings this is slightly faster than approach 1 if you wan't the function to return and array. This is implemented in function
difft
.So the choice really depends on what you wan't to do with the data. I guess a good option would be to overload the function so that it returns the same type that was used in the call. Actually passing a java array to diff instead of a vector makes ~1s faster.
Timings for the different functions:
diff returning vector:
diff returning Incanter.matrix:
diff returning double-array:
difft:
The functions
这是一个 Java 数组实现,它在我的系统上比您的 R 代码 (YMMV) 更快。请注意,启用反射警告(这在优化性能时至关重要),以及 y 上的重复类型提示(def 上的提示似乎对 aset 没有帮助)并将所有内容转换为原始双值(dotimes 确保i 是一个原始 int)。
Here's a Java arrays implementation that is on my system faster than your R code (YMMV). Note enabling the reflection warnings, which is essential when optimizing for performance, and the repeated type hint on y (the one on the def didn't seem to help for the aset) and casting everything to primitive double values (the dotimes makes sure that i is a primitive int).
Bradford Cross 的博客一堆关于这个的帖子(他在他工作的初创公司中使用了这些东西链接文本。一般来说,在内部循环、类型提示(通过
*warn-on-reflection*
)等都有助于提高速度。The Joy of Clojure 有一个关于性能调整的精彩部分,您应该阅读。Bradford Cross's blog has a bunch of posts about this (he uses this stuff for the startup he works on link text. In general, using transients in inner loops, type hinting (via
*warn-on-reflection*
) etc are all good for speed increases. The Joy of Clojure has a great section on performance tuning, which you should read.这是一个带有瞬变的解决方案 - 很吸引人,但速度很慢。
Here's a solution with transients - appealing but slow.
到目前为止,所有评论都是由似乎没有太多加速 Clojure 代码经验的人提出的。如果您希望 Clojure 代码执行与 Java 相同的功能 - 可以使用相应的工具来实现此目的。然而,对于矢量数学来说,采用 Colt 或 Parallel Colt 等成熟的 Java 库可能更有意义。使用 Java 数组来实现绝对最高性能迭代可能是有意义的。
@Shane 的链接充满了过时的信息,几乎不值得一看。另外,@Shane 的评论说代码比 10 倍慢,这根本不准确(并且不受支持 http://shootout.alioth.debian.org/u32q/compare.php?lang=clojure,这些基准测试并未考虑 1.2.0 或 1.3.0 中可能进行的优化类型-阿尔法1)。只需做一点工作,通常就可以轻松获得 4X-5X 的 Clojure 代码。除此之外,通常需要对 Clojure 的快速路径有更深入的了解 - 由于 Clojure 是一种相当年轻的语言,因此某些东西并未广泛传播。
Clojure 速度非常快。但是学习如何使其快速需要一些工作/研究,因为 Clojure 不鼓励可变操作和可变数据结构。
All the comments thus far are by people who don't seem to have much experience speeding up Clojure code. If you want Clojure code to perform identical to Java - the facilities are available to do so. It may make more sense however to defer to mature Java libraries like Colt or Parallel Colt for vector math. It may make sense to use Java arrays for the absolute highest performance iteration.
@Shane's link is so full of outdated information to be hardly worth looking at. Also @Shane's comment that code is slower than by factor of 10 is simply inaccurate (and unsupported http://shootout.alioth.debian.org/u32q/compare.php?lang=clojure, and these benchmarks don't account for the kinds of optimization possible in 1.2.0 or 1.3.0-alpha1). With a little bit of work it's usually easy to get Clojure code w/in 4X-5X. Beyond that usually requires a deeper knowledge of Clojure's fast paths - something isn't widely disseminated as Clojure is a fairly young language.
Clojure is plenty fast. But learning how to make it fast is going to take a bit of work/research as Clojure discourages mutable operations and mutable datastructures.