如何让下面的代码更快

发布于 2024-10-08 09:19:09 字数 1236 浏览 12 评论 0原文

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

上面提到的代码在我的程序中被调用了很多次(分析器显示 98%)。

编辑:在内部循环中,对于相同的 (i + k) 值, res1[i + k] 值被加载多次。我在 while 循环中尝试使用此方法,将所有 res1 值加载到 simd 寄存器(数组)中,并在最里面的 for 循环中使用数组元素来更新数组元素。两个 for 循环完成后,我将数组值存储回 res1、re2。但计算时间随之增加。知道我哪里错了吗?这个想法似乎是正确的

任何建议使它更快是受欢迎的。

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

萌逼全场 2024-10-15 09:19:09

不幸的是,最明显的优化可能已经由编译器完成了:

  • 您可以提取内部循环的 &_mulpre[u1]&mulpre[u2]
  • 您可以拉出内循环的 &res1[i]
  • 对两个内部操作使用不同的变量并对它们重新排序,可能会实现更好的流水线操作。

交换外部循环可能会改善 elm1 上的缓存局部性。

Unfortunately the most obvious optimisations are probably already being done by the compiler:

  • You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
  • You can pull &res1[i] our of the inner loop.
  • Using different variables for the two inner operations, and reordering them, might allow for better pipelining.

Possibly swapping the outer loops would improve cache locality on elm1.

岁月无声 2024-10-15 09:19:09

好吧,你总是可以更少地调用它:-)

总输入&输出数据看起来相对较小,根据您的设计和预期输入,仅缓存计算或进行延迟计算而不是预先进行评估可能是可行的。

Well, you could always call it fewer times :-)

The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.

怪异←思 2024-10-15 09:19:09

对于这样的例程,您几乎无能为力,因为加载和存储将是主导因素(对于单个计算指令,您正在执行 2 个加载 + 1 个存储 = 4 个总线周期)。

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).

明媚如初 2024-10-15 09:19:09
l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}
  1. 请记住您正在使用内在的,使用较少的 _128mi/_mm128 值将加快您的程序。
  2. 尝试_mm_stream_si128(),它可能会加快存储过程。
  3. 尝试预取
l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}
  1. Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
  2. try _mm_stream_si128(), it might speed up the storing process.
  3. try prefetch
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文