如何让下面的代码更快
int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
上面提到的代码在我的程序中被调用了很多次(分析器显示 98%)。
编辑:在内部循环中,对于相同的 (i + k) 值, res1[i + k] 值被加载多次。我在 while 循环中尝试使用此方法,将所有 res1 值加载到 simd 寄存器(数组)中,并在最里面的 for 循环中使用数组元素来更新数组元素。两个 for 循环完成后,我将数组值存储回 res1、re2。但计算时间随之增加。知道我哪里错了吗?这个想法似乎是正确的
任何建议使它更快是受欢迎的。
int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
The above mentioned code is called many times in my program (profiler shows 98%).
EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
Any suggestion to make it faster is welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不幸的是,最明显的优化可能已经由编译器完成了:
&_mulpre[u1]
和&mulpre[u2]
。&res1[i]
。交换外部循环可能会改善
elm1
上的缓存局部性。Unfortunately the most obvious optimisations are probably already being done by the compiler:
&_mulpre[u1]
and&mulpre[u2]
our of the inner loop.&res1[i]
our of the inner loop.Possibly swapping the outer loops would improve cache locality on
elm1
.好吧,你总是可以更少地调用它:-)
总输入&输出数据看起来相对较小,根据您的设计和预期输入,仅缓存计算或进行延迟计算而不是预先进行评估可能是可行的。
Well, you could always call it fewer times :-)
The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.
对于这样的例程,您几乎无能为力,因为加载和存储将是主导因素(对于单个计算指令,您正在执行 2 个加载 + 1 个存储 = 4 个总线周期)。
There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).