如何加速稀疏矩阵求解器?
我正在使用高斯-赛德尔方法编写稀疏矩阵求解器。通过分析,我确定程序大约一半的时间花在求解器内。性能关键部分如下:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
所有涉及的数组都是float
类型。实际上,它们不是数组,而是具有重载 []
运算符的对象,(我认为)应该对其进行优化,但定义如下:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
对于 d_nx = d_ny = 128
,这在 Intel i7 920 上每秒可以运行大约 3500 次。这意味着内部循环体每秒运行 3500 * 128 * 128 = 5700 万次。由于只涉及一些简单的算术,因此对于 2.66 GHz 处理器来说,这个数字对我来说太低了。
也许它不是受CPU能力限制,而是受内存带宽限制?好吧,一个 128 * 128 float
数组占用 65 kB,因此所有 6 个数组应该可以轻松放入 CPU 的 L3 缓存(8 MB)。假设寄存器中没有缓存任何内容,我在内循环体中计算了 15 次内存访问。在 64 位系统上,每次迭代需要 120 字节,因此 5700 万 * 120 字节 = 6.8 GB/s。 L3 缓存运行在 2.66 GHz,因此是相同的数量级。我的猜测是内存确实是瓶颈。
为了加快速度,我尝试了以下操作:
使用
g++ -O3
进行编译。 (嗯,我从一开始就这样做了。)使用 OpenMP 编译指令并行化 4 个以上的内核。我必须更改为雅可比算法以避免读取和写入同一数组。这要求我进行两倍的迭代,从而获得大约相同速度的最终结果。
摆弄循环体的实现细节,例如使用指针而不是索引。没有效果。
加快这个家伙速度的最佳方法是什么?在汇编中重写内部主体是否有帮助(我必须首先学习这一点)?我应该在 GPU 上运行这个(我知道怎么做,但太麻烦了)?还有其他好主意吗?
(注意,我确实接受“不”作为答案,例如:“它不可能做得更快,因为......”)
更新:根据要求,这是一个完整的程序:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
我按如下方式编译并运行它:(
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
它每秒执行 8000 次迭代,而不是 3500 次迭代,因为我的“真实”程序也执行很多其他操作。但它具有代表性。)
更新 2:有人告诉我,单位化值可能不具有代表性,因为 NaN 和 Inf 值可能会减慢速度。现在清除示例代码中的内存。不过,这对我的执行速度没有影响。
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float
type. Actually, they are not arrays but objects with an overloaded []
operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128
, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float
array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with
g++ -O3
. (Well, I'd been doing this from the beginning.)Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
几个想法:
使用 SIMD。您可以一次将每个数组中的 4 个浮点数加载到 SIMD 寄存器中(例如 Intel 上的 SSE、PowerPC 上的 VMX)。这样做的缺点是某些 d_x 值将“过时”,因此您的收敛速度将受到影响(但不像雅可比迭代那么糟糕);很难说加速是否会抵消它。
使用SOR。它很简单,不会增加太多计算,并且可以很好地提高收敛速度,即使对于相对保守的松弛值(例如 1.5)也是如此。
使用共轭梯度。如果这是用于流体模拟的投影步骤(即强制不可压缩性),您应该能够应用 CG 并获得更好的收敛速度。一个好的预处理器可以提供更多帮助。
使用专门的求解器。如果线性系统由泊松方程产生,您甚至可以比使用 FFT 的共轭梯度做得更好基于方法。
如果您可以更多地解释您要解决的系统是什么样子,我可能可以就#3 和#4 提供更多建议。
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
我想我已经成功地优化了它,这是一段代码,在 VC++ 中创建一个新项目,添加此代码并在“Release”下简单地编译。
像这样运行它:
app.exe 10000 o
app.exe 10000 n
“o”表示旧代码,你的。
“n”是我的,新的。
我的结果:
速度(原):
1515028
1523171
1495988
速度(新):
966012
984110
1006045
提高约30%。
背后的逻辑:
您一直在使用索引计数器来访问/操作。
我使用指针。
运行时,在VC++的调试器中的某个计算代码行处断点,然后按F8。您将看到反汇编程序窗口。
您将看到生成的操作码(汇编代码)。
无论如何,看看:
int *x = ...;
x[3] = 123;
这告诉 PC 将指针 x 放入寄存器(例如 EAX)。
添加它 (3 * sizeof(int))。
然后,将值设置为123。
您可以理解,指针方法要好得多,因为我们减少了添加过程,实际上我们自己处理它,因此可以根据需要进行优化。
我希望这有帮助。
stackoverflow.com 工作人员的旁注:
很棒的网站,我希望我很久以前就听说过它!
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
一方面,这里似乎存在管道问题。循环从刚刚写入的 d_x 中的值读取,但显然它必须等待写入完成。只需重新排列计算顺序,在等待时做一些有用的事情,就可以使其速度几乎提高一倍:
它是 Eamon Nerbonne 谁想出了这个办法。很多人为他点赞!我绝对不会猜到。
For one thing, there seems to be a pipelining issue here. The loop reads from the value in
d_x
that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
波尼的回答对我来说似乎是正确的。
我只是想指出,在此类问题中,您通常会从内存局部性中获益。现在,
b,w,e,s,n
数组都位于内存中的不同位置。如果您无法在 L3 缓存中解决该问题(主要是在 L2 中),那么这将是很糟糕的,并且此类解决方案将会有所帮助:例如,此解决方案在 1280x1280 下的分辨率稍差一些比 Poni 的解决方案快 2 倍(在我的测试中为 13 秒与 23 秒——您的原始实现为 22 秒),而在 128x128 下则慢 30%(7 秒与 10 秒) -你原来的时间是10s)。
(对于基本情况,迭代次数扩大到 80000 次;对于 100 倍大的情况(1280x1280),迭代次数扩大到 800 次。)
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the
b,w,e,s,n
arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
我认为你关于内存是瓶颈的说法是正确的。这是一个非常简单的循环,每次迭代只需要一些简单的算术。 ic、iw、ie、is、in 索引似乎位于矩阵的相对两侧,因此我猜测那里存在大量缓存未命中。
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
我不是这个主题的专家,但我发现有几个关于提高Gauss-Seidel方法的缓存使用的学术论文。
另一种可能的优化是使用红黑变体,其中点以棋盘状模式的两次扫描进行更新。这样,一次扫描中的所有更新都是独立的并且可以并行化。
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
我建议放入一些预取语句并研究“面向数据的设计”:
这与第二种方法不同,因为在执行计算之前将值复制到本地临时变量。
I suggest putting in some prefetch statements and also researching "data oriented design":
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.