IBM xlC 编译器与 Altivec 的循环优化

发布于 2024-12-05 01:19:57 字数 1180 浏览 5 评论 0原文

我只是在我们拥有的 power6 集群上使用 Altivec 扩展。我注意到，当我在没有任何优化的情况下编译下面的代码时，我的加速比是 4，正如我所期望的那样。然而，当我使用 -O3 标志再次编译它时，我成功获得了 60 的加速！

只是想知道是否有人对此有更多经验，并且能够提供一些关于编译器如何重新排列我的代码以执行这样的加速的见解。这是通过汇编和指令流水线进行的唯一可能的优化，还是我还缺少其他可以包含在未来工作中的东西。

int main(void) {
        const int m = 1000;

        __vector signed int va;
        __vector signed int vb;
        __vector signed int vc;
        __vector signed int vd;

        int a[m];
        int b[m];
        int c[m];

        for( int i=0 ; i < m ; i++ ) {
                a[i] = i;
                b[i] = i;
                c[i] = 0;
        }

        for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
                vd = (__vector signed int){cnt,cnt,cnt,cnt};

                for( int i = 0 ; i < m/4 ; i+=4 ) {
                        va = vec_ld(0, &a[i]);
                        vb = vec_ld(0, &b[i]);
                        vc = vec_add(vd, vec_add(va,vb));
                        vec_st(vc, 0, &c[i]);
                }
        }

        std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";

        return 0;
}

原文

I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!

Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.

int main(void) {
        const int m = 1000;

        __vector signed int va;
        __vector signed int vb;
        __vector signed int vc;
        __vector signed int vd;

        int a[m];
        int b[m];
        int c[m];

        for( int i=0 ; i < m ; i++ ) {
                a[i] = i;
                b[i] = i;
                c[i] = 0;
        }

        for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
                vd = (__vector signed int){cnt,cnt,cnt,cnt};

                for( int i = 0 ; i < m/4 ; i+=4 ) {
                        va = vec_ld(0, &a[i]);
                        vb = vec_ld(0, &b[i]);
                        vc = vec_add(vd, vec_add(va,vb));
                        vec_st(vc, 0, &c[i]);
                }
        }

        std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";

        return 0;
}

分享到QQ

分享到微博