IBM xlC 编译器与 Altivec 的循环优化
我只是在我们拥有的 power6 集群上使用 Altivec 扩展。我注意到,当我在没有任何优化的情况下编译下面的代码时,我的加速比是 4,正如我所期望的那样。然而,当我使用 -O3 标志再次编译它时,我成功获得了 60 的加速!
只是想知道是否有人对此有更多经验,并且能够提供一些关于编译器如何重新排列我的代码以执行这样的加速的见解。这是通过汇编和指令流水线进行的唯一可能的优化,还是我还缺少其他可以包含在未来工作中的东西。
int main(void) {
const int m = 1000;
__vector signed int va;
__vector signed int vb;
__vector signed int vc;
__vector signed int vd;
int a[m];
int b[m];
int c[m];
for( int i=0 ; i < m ; i++ ) {
a[i] = i;
b[i] = i;
c[i] = 0;
}
for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
vd = (__vector signed int){cnt,cnt,cnt,cnt};
for( int i = 0 ; i < m/4 ; i+=4 ) {
va = vec_ld(0, &a[i]);
vb = vec_ld(0, &b[i]);
vc = vec_add(vd, vec_add(va,vb));
vec_st(vc, 0, &c[i]);
}
}
std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";
return 0;
}
I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!
Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.
int main(void) {
const int m = 1000;
__vector signed int va;
__vector signed int vb;
__vector signed int vc;
__vector signed int vd;
int a[m];
int b[m];
int c[m];
for( int i=0 ; i < m ; i++ ) {
a[i] = i;
b[i] = i;
c[i] = 0;
}
for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
vd = (__vector signed int){cnt,cnt,cnt,cnt};
for( int i = 0 ; i < m/4 ; i+=4 ) {
va = vec_ld(0, &a[i]);
vb = vec_ld(0, &b[i]);
vc = vec_add(vd, vec_add(va,vb));
vec_st(vc, 0, &c[i]);
}
}
std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在 Power 7 上做了一些工作,并且在 XLC 编译器上看到了非常奇怪的事情。但没有这么奇怪! (至少不是 60x...)
关于 PowerPC 系列(至少对于 Power6 和 Power7)需要注意的一件事是,与 x86/x64 相比,指令延迟非常长,乱序执行非常弱。
因此,内部循环(如代码中所写)将获得极低的 IPC。
现在,我能想象获得 60 倍加速的唯一方法是内部循环在 -O3 下完全展开。这是可能的,因为内部循环的行程计数可以静态确定为 63。
展开该内部循环基本上将允许填充整个管道。
当然我只是猜测。最好的选择是查看装配。
另外,你如何安排这个时间?我在 PowerPC 上看到的很多奇怪的行为都来自计时器本身...
编辑:
由于您的示例代码相当简单,因此应该很容易(在程序集中)发现内部循环是否部分地或完全展开。
I've done some stuff on Power 7, and I have seen very odd things with the XLC compiler. But not as odd as this! (not 60x at least...)
One thing to note about the PowerPC series (at least for Power6 and Power7), is that the instruction latencies are very long and the out-of-order execution is very weak compared to x86/x64.
Therefore, the inner loop (as written in your code) will get extremely low IPC.
Now, the only way I can imagine you getting 60x speedup is that the inner loop is completely unrolled under -O3. This is possible since the trip count of the inner loop can be statically determined to be 63.
Unrolling that inner loop will basically allow the entire pipeline to be filled.
Of course I'm just guessing. Your best bet is to look at the assembly.
Also, how are you timing this? A lot of the weird behavior I've seen on PowerPC is from the timers themselves...
EDIT:
Since your sample code is fairly simple, it should be very easy to spot (in the assembly) whether or not that inner loop is partially or completely unrolled.