嵌套 for 循环内的 SSE 指令

发布于 2024-12-19 08:21:12 字数 1335 浏览 1 评论 0原文

我的代码中有几个嵌套的 for 循环，我尝试在英特尔 i7 核心上使用英特尔 SSE 指令来加速应用程序。代码结构如下（val在更高的for循环中设置）：

_m128 in1,in2,tmp1,tmp2,out;
float arr[4] __attribute__ ((aligned(16)));
val = ...;

... several higher for loops ...
for(f=0; f<=fend; f=f+4){
    index2 = ...;
    for(i=0; i<iend; i++){
        for(j=0; j<jend; j++){
            inputval = ...;
            index = ...;
            if(f<fend-4){
                arr[0] = array[index];
                arr[1] = array[index+val];
                arr[2] = array[index+2*val];
                arr[3] = array[index+3*val];
                in1  = _mm_load_ps(arr);
                in2  = _mm_set_ps1(inputval);
                tmp1 = _mm_mul_ps(in1, in2);
                tmp2 = _mm_loadu_ps(&array2[index2]);
                out  = _mm_add_ps(tmp1,tmp2);
                _mm_storeu_ps(&array2[index2], out);
            } else {
                //if no 4 values available for SSE instruction execution execute serial code
                for(int u = 0; u < fend-f; u++ ) array2[index2+u] += array[index+u*val] * inputval;
            }
        }
    }
}

我认为有两个主要问题：用于对齐“array”中的值的缓冲区，以及当没有剩下4个值时（例如当fend = 6，剩下两个值应该用顺序代码执行）。是否有其他方法可以从 in1 加载值和/或使用 3 或 2 个值执行 SSE 指令？

感谢到目前为止的回答。我认为加载效果很好，但是 else 语句中的“剩余”部分是否有任何可以使用 SSE 指令解决的解决方法？

原文

i have several nested for loops in my code and i try to use intel SSE instructions on an intel i7 core to speed up the application.
The code structure is as follows (val is set in a higher for loop):

_m128 in1,in2,tmp1,tmp2,out;
float arr[4] __attribute__ ((aligned(16)));
val = ...;

... several higher for loops ...
for(f=0; f<=fend; f=f+4){
    index2 = ...;
    for(i=0; i<iend; i++){
        for(j=0; j<jend; j++){
            inputval = ...;
            index = ...;
            if(f<fend-4){
                arr[0] = array[index];
                arr[1] = array[index+val];
                arr[2] = array[index+2*val];
                arr[3] = array[index+3*val];
                in1  = _mm_load_ps(arr);
                in2  = _mm_set_ps1(inputval);
                tmp1 = _mm_mul_ps(in1, in2);
                tmp2 = _mm_loadu_ps(&array2[index2]);
                out  = _mm_add_ps(tmp1,tmp2);
                _mm_storeu_ps(&array2[index2], out);
            } else {
                //if no 4 values available for SSE instruction execution execute serial code
                for(int u = 0; u < fend-f; u++ ) array2[index2+u] += array[index+u*val] * inputval;
            }
        }
    }
}

I think there are two main problems: the buffer used for aligning the values from 'array', and the fact that when no 4 values are left (e.g. when fend = 6, two values are left over which should be executed with the sequential code). Is there any other way of loading the values from in1 and/or executing SSE intructions with 3 or 2 values?

Thanks for the answers so far. The loading is as good as it gets i think, but is there any workaround for the 'leftover' part within the else statement that could be solved using SSE instructions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌伤ぢ 2024-12-26 08:21:12

我认为更大的问题是，对于如此大量的数据移动，计算量太少：

arr[0] = array[index];                   //  Data Movement
arr[1] = array[index+val];               //  Data Movement
arr[2] = array[index+2*val];             //  Data Movement
arr[3] = array[index+3*val];             //  Data Movement
in1  = _mm_load_ps(arr);                 //  Data Movement
in2  = _mm_set_ps1(inputval);            //  Data Movement
tmp1 = _mm_mul_ps(in1, in2);             //  Computation
tmp2 = _mm_loadu_ps(&array2[index2]);    //  Data Movement
out  = _mm_add_ps(tmp1,tmp2);            //  Computation
_mm_storeu_ps(&array2[index2], out);     //  Data Movement

虽然“可能”可以简化这一点。我根本不相信矢量化在这种情况下会有任何好处。

您必须更改数据布局以避免跨步访问index + n*val。

或者您可以等到 2013 年 AVX2 聚集/分散指令推出？

I think the bigger problem is that there is so little computation for such a massive amount of data movement:

arr[0] = array[index];                   //  Data Movement
arr[1] = array[index+val];               //  Data Movement
arr[2] = array[index+2*val];             //  Data Movement
arr[3] = array[index+3*val];             //  Data Movement
in1  = _mm_load_ps(arr);                 //  Data Movement
in2  = _mm_set_ps1(inputval);            //  Data Movement
tmp1 = _mm_mul_ps(in1, in2);             //  Computation
tmp2 = _mm_loadu_ps(&array2[index2]);    //  Data Movement
out  = _mm_add_ps(tmp1,tmp2);            //  Computation
_mm_storeu_ps(&array2[index2], out);     //  Data Movement

While it "might" be possible to simplify this. I'm not at all convinced that vectorization is going to be beneficial at all in this situation.

You'll have to change your data layout to make avoid the strided access index + n*val.

Or you can wait until AVX2 gather/scatter instructions become available in 2013?

回复收藏 0 原文

毁虫ゝ 2024-12-26 08:21:12

表达

            arr[0] = array[index];
            arr[1] = array[index+val];
            arr[2] = array[index+2*val];
            arr[3] = array[index+3*val];
            in1  = _mm_load_ps(arr);

您可以更简洁地

            in1  = _mm_set_ps(array[index+3*val], array[index+2*val], array[index+val], array[index]);

为：并去掉 arr，这可能给编译器一些机会来优化一些冗余的加载/存储。

然而，您的数据组织是主要问题，而且您几乎没有进行与加载和存储数量相关的计算，其中两个未对齐，这一事实使情况变得更加复杂。如果可能的话，您需要重新组织数据结构，以便在所有情况下都可以从对齐的连续内存中一次加载和存储 4 个元素，否则任何计算优势都将被低效的内存访问模式所抵消。

You can express this:

            arr[0] = array[index];
            arr[1] = array[index+val];
            arr[2] = array[index+2*val];
            arr[3] = array[index+3*val];
            in1  = _mm_load_ps(arr);

more succinctly as:

            in1  = _mm_set_ps(array[index+3*val], array[index+2*val], array[index+val], array[index]);

and get rid of arr, which might give the compiler some opportunity to optimise away some redundant loads/stores.

However your data organisation is the main problem, compounded by the fact that you are doing almost no computation relative to the number of loads and stores, two of which are unaligned. If possible you need to re-organise your data structures so that you can load and store 4 elements at a time form aligned contiguous memory in all cases, otherwise any computational benefits will tend to be outweighed by inefficient memory access patterns.

回复收藏 0 原文

停滞 2024-12-26 08:21:12

如果您希望从 SSE 中获得全部好处（在不显式使用 SSE 的情况下比最佳优化代码快 4 倍或更多），您必须确保您的数据布局只需要对齐的加载和存储。尽管在代码片段中使用 _mm_set_ps(w,z,y,x) 可能会有所帮助，但您应该避免这种需要，即避免跨步访问（它们的效率低于单个 _mm_load_ps）。

至于最后几个<4个元素的问题，我通常会确保我的所有数据不仅是16字节对齐，而且数组大小是16字节的倍数，这样我就不会再有这样多余的剩余元素。当然，真正的问题可能有备用元素，但通常可以设置该数据，使其不会引起问题（设置为中性元素，即加法运算为零）。在极少数情况下，您只想处理以未对齐位置开始和/或结束的数组子集。在这种情况下，可以使用按位运算（_mm_and_ps、_mm_or_ps）来抑制对不需要的元素的操作。

回复收藏 0 原文

~没有更多了~