Java 矩阵数学库的性能?

发布于 2024-07-13 05:41:57 字数 907 浏览 7 评论 0原文

我们正在计算其运行时间受矩阵运算约束的东西。 (如果感兴趣的话,下面有一些详细信息。)这种经历引发了以下问题:

人们对矩阵数学(例如,乘法、逆运算等)的 Java 库的性能有经验吗? 例如:

我搜索并什么也没找到。


我们的速度比较的详细信息:

我们使用的是 Intel FORTRAN (ifort (IFORT) 10.1 20070913)。 我们使用 Apache commons math 1.2 矩阵运算在 Java (1.6) 中重新实现了它,并且它同意其所有数字的准确性。 (我们有理由在 Java 中使用它。)(Java doubles,Fortran real*8)。 Fortran:6 分钟,Java 33 分钟,同一台机器。 jvisualm 分析显示在 RealMatrixImpl.{getEntry,isValidCooperative} 中花费了大量时间(这似乎在未发布的 Apache commons math 2.0 中消失了,但 2.0 并没有更快)。 Fortran 使用 Atlas BLAS 例程(dpotrf 等)。

显然,这可能取决于每种语言的代码,但我们相信大多数时间都是等效的矩阵运算。

在其他几个不涉及库的计算中,Java 并没有慢很多,有时甚至快得多。

We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:

Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:

I searched and found nothing.


Details of our speed comparison:

We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).

Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.

In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

墟烟 2024-07-20 05:41:58

Jeigen https://github.com/hughperkins/jeigen

  • 包装 Eigen C++ 库 http://eigen.tuxfamily.org ,这是最快的免费 C++ 库之一,
  • 语法相对简洁,例如 'mmul'、'sub'
  • 处理稠密矩阵和稀疏矩阵

快速测试,通过将两个稠密矩阵相乘,即:

import static jeigen.MatrixUtil.*;

int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();

结果:

Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms
  • 与 jama 相比,一切都更快 :-P
  • 与 jblas 相比,Jeigen 的速度没有那么快,但它可以处理稀疏矩阵。
  • 与 ojalgo 相比,Jeigen 花费的运行时间大致相同,但仅使用一个核心,因此 Jeigen 使用了总 CPU 的一半。 Jeigen 有更简洁的语法,即“mmul”与“multiplyRight”

Jeigen https://github.com/hughperkins/jeigen

  • wraps Eigen C++ library http://eigen.tuxfamily.org , which is one of the fastest free C++ libraries available
  • relatively terse syntax, eg 'mmul', 'sub'
  • handles both dense and sparse matrices

A quick test, by multiplying two dense matrices, ie:

import static jeigen.MatrixUtil.*;

int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();

Results:

Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms
  • Compared to jama, everything is faster :-P
  • Compared to jblas, Jeigen is not quite as fast, but it handles sparse matrices.
  • Compared to ojalgo, Jeigen takes about the same amount of elapsed time, but only using one core, so Jeigen uses half the total cpu. Jeigen has a terser syntax, ie 'mmul' versus 'multiplyRight'
2024-07-20 05:41:58

java 上有各种矩阵包的基准测试
http://code.google.com/p/java-matrix-benchmark/ 对于一些不同的硬件配置。 但这并不能替代您自己的基准测试。

性能将根据您所拥有的硬件类型(CPU、内核、内存、L1-3 缓存、总线速度)、矩阵的大小以及您打算使用的算法而有所不同。 不同的库对不同算法的并发性有不同的看法,因此没有单一的答案。 您可能还会发现,转换为本机库所期望的形式的开销会抵消您的用例的性能优势(某些 Java 库在矩阵存储方面具有更灵活的选项,可用于进一步的性能优化)。

但总的来说,JAMA、Jampack 和 COLT 已经过时了,并不代表 Java 线性代数的当前性能状态。 更现代的库可以更有效地利用多核和 CPU 缓存。 JAMA 是一个参考实现,几乎实现了教科书算法,很少考虑性能。 COLT 和 IBM Ninja 是第一个展示 Java 性能的 Java 库,尽管它们落后于本机库 50%。

There's a benchmark of various matrix packages available in java up on
http://code.google.com/p/java-matrix-benchmark/ for a few different hardware configurations. But it's no substitute for doing your own benchmark.

Performance is going to vary with the type of hardware you've got (cpu, cores, memory, L1-3 cache, bus speed), the size of the matrices and the algorithms you intend to use. Different libraries have different takes on concurrency for different algorithms, so there's no single answer. You may also find that the overhead of translating to the form expected by a native library negates the performance advantage for your use case (some of the java libraries have more flexible options regarding matrix storage, which can be used for further performance optimizations).

Generally though, JAMA, Jampack and COLT are getting old, and do not represent the state of the current performance available in Java for linear algebra. More modern libraries make more effective use of multiple cores and cpu caches. JAMA was a reference implementation, and pretty much implements textbook algorithms with little regard to performance. COLT and IBM Ninja were the first java libraries to show that performance was possible in java, even if they lagged 50% behind native libraries.

沫雨熙 2024-07-20 05:41:58

我是 la4j(Java 线性代数)库的作者,这是我的观点。 我已经在 la4j 上工作了 3 年(最新版本是 0.4.0 [01 Jun 2013]),直到现在我才可以开始进行性能分析和优化,因为我刚刚涵盖了所需的最低功能。 所以,la4j 没有我想要的那么快,但我花了很多时间来改变它。

我目前正在将新版本的 la4j 移植到 JMatBench 平台。 我希望新版本能够比以前的版本表现出更好的性能,因为我在 la4j 中做了一些改进,例如更快的内部矩阵格式、不安全的访问器和矩阵乘法的快速阻塞算法。

I'm the author of la4j (Linear Algebra for Java) library and here is my point. I've been working on la4j for 3 years (the latest release is 0.4.0 [01 Jun 2013]) and only now I can start doing performace analysis and optimizations since I've just covered the minimal required functional. So, la4j isn't as fast as I wanted but I'm spending loads of my time to change it.

I'm currently in the middle of porting new version of la4j to JMatBench platform. I hope new version will show better performance then previous one since there are several improvement I made in la4j such as much faster internal matrix format, unsafe accessors and fast blocking algorithm for matrix multiplications.

昔日梦未散 2024-07-20 05:41:58

您是否看过 英特尔数学内核库? 它声称甚至优于 ATLAS。 MKL 可以是 通过 JNI 包装器在 Java 中使用

Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Java through JNI wrappers.

比忠 2024-07-20 05:41:58

Linalg 代码严重依赖 Pentium 和更高版本的处理器的矢量计算功能(从 MMX 扩展开始,如 LAPACK 和现在的 Atlas BLAS),并没有“完美优化”,而只是行业标准。 要在 Java 中复制这种性能,您将需要本机库。 我遇到了与您描述的相同的性能问题(主要是能够计算 Choleski 分解),并且没有发现任何真正有效的东西:Jama 是纯 Java,因为它应该只是供实施者遵循的模板和参考工具包。 ..这从未发生过。 你知道 Apache math commons...至于 COLT,我仍然需要对其进行测试,但它似乎严重依赖于 Ninja 的改进,其中大部分是通过构建临时 Java 编译器来实现的,所以我怀疑它会有所帮助。
到那时,我认为我们“只是”需要集体努力来构建本地 Jama 实现......

Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help.
At that point, I think we "just" need a collective effort to build a native Jama implementation...

恋竹姑娘 2024-07-20 05:41:58

基于 Varkhan 的帖子,即特定于 Pentium 的本机代码会做得更好:

Building on Varkhan's post that Pentium-specific native code would do better:

归途 2024-07-20 05:41:58

我们已经使用 COLT 进行了一些相当大的、严肃的财务计算,并且对此非常满意。 在我们经过大量分析的代码中,我们几乎从未需要用我们自己的实现来替换 COLT 实现。

在他们自己的测试中(显然不是独立的),我认为他们声称在英特尔手动优化汇编程序例程的 2 倍之内。 很好地使用它的技巧是确保您理解它们的设计理念,并避免无关的对象分配。

We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.

In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.

☆獨立☆ 2024-07-20 05:41:58

我发现,如果您要创建大量高维矩阵,如果将其更改为使用一维数组而不是二维数组,则可以使 Jama 的速度提高约 20%。 这是因为 Java 不能有效地支持多维数组。 IE。 它创建一个数组的数组。

Colt 已经做到了这一点,但我发现它比 Jama 更复杂、更强大,这可以解释为什么 Colt 的简单函数速度较慢。

答案实际上取决于你正在做的事情。 柯尔特所做的那些能够带来更大改变的事情中,贾玛并不支持一小部分。

I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.

Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.

The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.

陌路终见情 2024-07-20 05:41:58

您可能想查看 jblas 项目。 它是一个相对较新的 Java 库,使用 BLAS、LAPACK 和 ATLAS 进行高性能矩阵运算。

开发人员发布了一些基准,其中jblas 在与 MTJ 和 Colt 的比赛中表现出色。

You may want to check out the jblas project. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.

The developer has posted some benchmarks in which jblas comes off favourably against MTJ and Colt.

森罗 2024-07-20 05:41:58

对于 3d 图形应用程序,lwjgl.util 向量实现的性能比上述 jblas 的性能高出约 3 倍。

我已经对 vec4 与 4x4 矩阵进行了 100 万次矩阵乘法。

lwjgl大约需要18ms完成,jblas大约需要60ms。

(我认为 JNI 方法不太适合快速连续应用相对较小的乘法。因为转换/映射可能比乘法的实际执行花费更多时间。)

For 3d graphics applications the lwjgl.util vector implementation out-performed above mentioned jblas by a factor of about 3.

I have done 1 million matrix multiplications of a vec4 with a 4x4 matrix.

lwjgl finished in about 18ms, jblas required about 60ms.

(I assume, that the JNI approach is not very suitable for fast successive application of relatively small multiplications. Since the translation/mapping may take more time than the actual execution of the multiplication.)

尘世孤行 2024-07-20 05:41:58

还有UJMP

There's also UJMP

蓝眸 2024-07-20 05:41:58

有许多不同的免费可用的 java 线性代数库。 http://www.ujmp.org/java-matrix/benchmark/
不幸的是,该基准测试仅提供有关矩阵乘法的信息(转置测试不允许不同的库利用其各自的设计功能)。

您应该关注的是这些线性代数库在被要求计算各种矩阵分解时的表现。
http://ojalgo.org/matrix_compare.html

There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/
Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).

What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions.
http://ojalgo.org/matrix_compare.html

庆幸我还是我 2024-07-20 05:41:58

Matrix Tookits Java (MTJ) 之前已经提到过,但对于其他偶然发现这个线程的人来说,也许值得再次提及。 对于那些感兴趣的人,似乎还讨论了让 MTJ 替换 apache commons math 2.0,尽管我不确定最近进展如何。

Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.

一萌ing 2024-07-20 05:41:58

您应该将 Apache Mahout 添加到您的购物清单中。

You should add Apache Mahout to your shopping list.

夏了南城 2024-07-20 05:41:57

我是 Java Matrix Benchmark (JMatBench) 的作者,我将就这次讨论发表我的想法。

Java 库之间存在显着差异,虽然在整个操作范围内没有明显的赢家,但有一些明显的领导者,如 最新性能结果(2013 年 10 月)。

如果您正在处理“大型”矩阵并且可以使用本机库,那么明显的赢家(大约快 3.5 倍)是 MTJ 系统优化的netlib。 如果您需要纯 Java 解决方案,则 MTJOjAlgoEJMLParallel Colt 是不错的选择。 对于小矩阵,EJML 是明显的赢家。

我没有提到的库显示出严重的性能问题或缺少关键功能。

I'm the author of Java Matrix Benchmark (JMatBench) and I'll give my thoughts on this discussion.

There are significant difference between Java libraries and while there is no clear winner across the whole range of operations, there are a few clear leaders as can be seen in the latest performance results (October 2013).

If you are working with "large" matrices and can use native libraries, then the clear winner (about 3.5x faster) is MTJ with system optimised netlib. If you need a pure Java solution then MTJ, OjAlgo, EJML and Parallel Colt are good choices. For small matrices EJML is the clear winner.

The libraries I did not mention showed significant performance issues or were missing key features.

肥爪爪 2024-07-20 05:41:57

只是添加我的 2 美分。 我比较了其中一些库。 我尝试将 3000 x 3000 双精度矩阵与其自身进行矩阵乘法。 结果如下。

使用带有 C/C++、Octave、Python 和 R 的多线程 ATLAS,所需时间约为 4 秒。

使用 Jama 和 Java,花费的时间是 50 秒。

使用 Colt 和 Parallel Colt 与 Java,花费的时间是 150 秒!

将 JBLAS 与 Java 结合使用时,由于 JBLAS 使用多线程 ATLAS,所花费的时间再次约为 4 秒。

所以对我来说,很明显 Java 库的性能不太好。 但是,如果有人必须使用 Java 进行编码,那么最好的选择是 JBLAS。 Jama、Colt 和 Parallel Colt 速度都不快。

Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.

Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.

Using Jama with Java, the time taken was 50 seconds.

Using Colt and Parallel Colt with Java, the time taken was 150 seconds!

Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.

So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.

半步萧音过轻尘 2024-07-20 05:41:57

我是 jblas 的主要作者,想指出的是,我已于 2009 年 12 月下旬发布了 1.0 版。我在打包方面做了很多工作,这意味着您现在只需下载一个包含 ATLAS 和 JNI 库的“fat jar”适用于 Windows、Linux、Mac OS X、32 和 64 位(Windows 除外)。 这样,您只需将 jar 文件添加到类路径即可获得本机性能。 请访问 http://jblas.org 查看!

I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!

书间行客 2024-07-20 05:41:57

我刚刚将 Apache Commons Math 与 jlapack 进行了比较。

测试:随机 1024x1024 矩阵的奇异值分解。

机器:Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz,linux x64

倍频代码:A=rand(1024); tic;[U,S,V]=svd(A);toc

results                                execution time
---------------------------------------------------------
Octave                                 36.34 sec

JDK 1.7u2 64bit
    jlapack dgesvd                     37.78 sec
    apache commons math SVD            42.24 sec


JDK 1.6u30 64bit
    jlapack dgesvd                     48.68 sec
    apache commons math SVD            50.59 sec

Native routines
Lapack* invoked from C:                37.64 sec
Intel MKL                               6.89 sec(!)

我的结论是从 JDK 1.7 调用的 jlapack 非常接近原生
lapack 的二进制性能。 我使用了 Linux 发行版附带的 lapack 二进制库,并调用 dgesvd 例程来获取 U、S 和 VT 矩阵。 所有测试都是在每次运行的完全相同的矩阵上使用双精度完成的(Octave 除外)。

免责声明 - 我不是线性代数专家,不隶属于上述任何库,这不是严格的基准。
这是一个“自制”测试,因为我有兴趣比较 JDK 1.7 到 1.6 以及 commons math SVD 和 jlapack 的性能提升。

I just compared Apache Commons Math with jlapack.

Test: singular value decomposition of a random 1024x1024 matrix.

Machine: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, linux x64

Octave code: A=rand(1024); tic;[U,S,V]=svd(A);toc

results                                execution time
---------------------------------------------------------
Octave                                 36.34 sec

JDK 1.7u2 64bit
    jlapack dgesvd                     37.78 sec
    apache commons math SVD            42.24 sec


JDK 1.6u30 64bit
    jlapack dgesvd                     48.68 sec
    apache commons math SVD            50.59 sec

Native routines
Lapack* invoked from C:                37.64 sec
Intel MKL                               6.89 sec(!)

My conclusion is that jlapack called from JDK 1.7 is very close to the native
binary performance of lapack. I used the lapack binary library coming with linux distro and invoked the dgesvd routine to get the U,S and VT matrices as well. All tests were done using double precision on exactly the same matrix each run (except Octave).

Disclaimer - I'm not an expert in linear algebra, not affiliated to any of the libraries above and this is not a rigorous benchmark.
It's a 'home-made' test, as I was interested comparing the performance increase of JDK 1.7 to 1.6 as well as commons math SVD to jlapack.

半世蒼涼 2024-07-20 05:41:57

我无法真正评论特定的库,但原则上,此类操作在 Java 中没有理由变慢。 Hotspot 通常会执行您期望编译器执行的操作:它将 Java 变量上的基本数学运算编译为相应的机器指令(它使用 SSE 指令,但每个运算只有一个); 对数组元素的访问被编译为使用“原始”MOV 指令,正如您所期望的那样; 它决定如何在可能的情况下将变量分配给寄存器; 它重新排序指令以利用处理器架构...一个可能的例外是,正如我所提到的,Hotspot 只会对每条 SSE 指令执行一个操作; 原则上,你可以拥有一个极其优化的矩阵库,它可以对每条指令执行多个操作,尽管我不知道你的特定 FORTRAN 库是否这样做,或者这样的库是否存在。 如果确实如此,那么 Java(或者至少是 Hotspot)目前没有办法与之竞争(尽管您当然可以使用这些优化来编写自己的本机库以从 Java 调用)。

那么,这意味着什么? 嗯:

  • 原则上,值得寻找一个性能更好的库,但不幸的是,
  • 如果性能对您来说真的很重要,我无法推荐一个库,我会考虑只编写自己的矩阵运算,因为这样您就可以执行库通常无法进行的某些优化,或者您使用的特定库无法进行的某些优化(如果您有一台多处理器计算机,请查明该库是否实际上是多线程的)

矩阵运算的障碍通常是数据局部性问题,这些问题在以下情况下出现您需要逐行遍历和逐列遍历,例如在矩阵乘法中,因为您必须以优化其中之一的顺序存储数据。 但是,如果您手写代码,有时可以组合操作来优化数据局部性(例如,如果您将矩阵与其变换相乘,则可以将列遍历转换为行遍历,如果您编写一个专用函数而不是组合两个库函数)。 与生活中一样,库会为您提供非最佳性能,以换取更快的开发; 您需要确定性能对您来说有多重要。

I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).

So what does all this mean? Well:

  • in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
  • if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)

A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文