矩阵实现基准测试,我应该鞭打自己吗?
我正在尝试在线查找一些矩阵乘法/求逆基准。 我的 C++ 实现目前可以在 38 秒内反转 100 x 100 矩阵,但与 这个我发现的基准,我的实施的表现确实很糟糕。 我不知道这是否是一个超级优化的东西,或者是否真的可以在大约 0.11 秒内轻松反转 200 x 200 矩阵,所以我正在寻找更多基准来比较结果。 大神有好的链接吗?
更新 我在乘法代码中发现了一个错误,该错误不会影响结果,但会导致无用的循环浪费。 现在我的反转在 20 秒内执行。 还有很多时间,欢迎提出任何想法。
谢谢各位
I'm trying to find out some matrix multiplication/inversion benchmarks online. My C++ implementation can currently invert a 100 x 100 matrix in 38 seconds, but compared to this benchmark I found, my implementation's performances really suck. I don't know if it's a super-optimized something or if really you can easily invert a 200 x 200 matrix in about 0.11 seconds, so I'm looking for more benchmarks to compare the results. Have you god some good link?
UPDATE
I spotted a bug in my multiplication code, that didn't affect the result but was causing useless cycle waste. Now my inversion executes in 20 seconds. It's still a lot of time, and any idea is welcome.
Thank you folks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
此类操作对缓存极其敏感。 您希望对 L1 和 L1 中的变量进行大部分工作。 二级缓存。 查看本文档的第 6 节:
http://people.redhat.com/drepper/cpumemory。 pdf
他引导您以缓存优化的方式优化矩阵乘法,并获得一些重大的性能改进。
This sort of operation is extremely cache sensitive. You want to be doing most of your work on variables that are in your L1 & L2 cache. Check out section 6 of this doc:
http://people.redhat.com/drepper/cpumemory.pdf
He walks you through optimizing a matrix multiply in a cache-optimized way and gets some big perf improvements.
检查您是否按值传递巨大的矩阵对象(因为如果复制整个矩阵,这可能会代价高昂)。
如果可能的话通过参考。
关于矩阵和 C++ 的问题是您希望尽可能避免复制。
因此,您的主要对象可能不应该包含“矩阵数据”,而是包含有关矩阵的元数据和指向数据部分的指针(由智能的东西包裹)。 因此,在复制对象时,您仅复制一小部分数据,而不是整个数据(请参阅字符串实现的示例)。
Check if you are passing huge matrix objects by value (As this could be costly if copying the whole matrix).
If possable pass by reference.
The thing about matricies and C++ is that you want to avoid copying as much as possable.
So your main object should probably not conatain the "matrix data" but rather contain meta data about the matrix and a pointer (wrapped in by somthing smart) to the data portion. Thus when copying an object you only copy a small chunk of data not the whole thing (see string implementation for an example).
为什么您首先需要实现自己的矩阵库? 正如您已经发现的,已经有非常高效的库可以做同样的事情。 尽管人们喜欢将 C++ 视为一种性能语言,但只有当您真的很擅长该语言时,这种情况才是正确的。 用 C++ 编写非常慢的代码是非常容易的。
Why do you need to implement your own matrix library in the first place? As you've already discovered, there are already extremely efficient libraries available doing the same thing. And as much as people like to think of C++ as a performance language, that's only true if you're really good at the language. It is extremely easy to write terribly slow code in C++.
MATLAB 也不费吹灰之力就能做到这一点。 您是否正在实施用于矩阵求逆(例如 LU 分解)的 LAPACK 例程?
MATLAB does that without breaking a sweat either. Are you implementing the LAPACK routines for matrix inversion (e.g. LU decomposition)?
您尝试过对其进行分析吗?
按照这篇论文 (pdf),计算 100x100进行 LU 分解的矩阵将需要 1348250(浮点运算)。 核心 2 可以执行大约 20 Gflops(处理器指标) 。 所以理论上来说你可以在 1 毫秒内完成一次反转。
如果没有代码,很难断言造成巨大差距的原因是什么。 根据我尝试微优化(如循环展开、缓存值、SEE、线程等)的经验,您只会获得加速,这充其量只是您当前的一个常数因子(这可能对您来说足够了)。
但是如果你想要一个数量级的速度提升,你应该看看你的算法,也许你的 LU 分解的实现有一个错误。 另一个需要注意的地方是数据的组织,尝试不同的组织,将行/列元素放在一起。
Have you tried profiling it?
Following this paper (pdf), the calculation for a 100x100 matrix with LU decomposition will need 1348250 (floating point operations). A core 2 can do around 20 Gflops (processor metrics). So theoretically speaking you can do an inversion in 1 ms.
Without the code is pretty difficult to assert what is the cause of the large gap. From my experience trying micro-optimization like loop unrolling, caching values, SEE, threading, etc, you only will get a speed up, which at best is only a constant factor of you current (which maybe enough for you).
But if you want an order of magnitude speed increase you should take a look at your algorithm, perhaps your implementation of LU decomposition have a bug. Another place to take a look is the organization of your data, try different organization, put row/columns elements together.
LINPACK 基准测试基于解决线性代数问题。 它们可用于不同的机器和语言。 也许他们也能帮助你。
LINPACK C++ 库也可在此处获取。
The LINPACK benchmarks are based on solving linear algebra problems. They're available for different machines and languages. Maybe they can help you, too.
LINPACK C++ libraries available here, too.
实际上,我使用 **
double
**s 而不是 **long double
**s 获得了大约 7 秒的时间,但这并不是什么大问题,因为我损失了一半的精度。I actually gained about 7 seconds using **
double
**s instead of **long double
**s, but that's not a great deal since I lost half of my precision.