SIMD 优化难题
我想使用 SIMD(SSE2 等)优化以下函数:
int64_t fun(int64_t N, int size, int* p)
{
int64_t sum = 0;
for(int i=1; i<size; i++)
sum += (N/i)*p[i];
return sum;
}
这似乎是一个非常可矢量化的任务,只是所需的指令不存在......
我们可以假设 N 非常大(10^12 到10^18) 和大小~sqrt(N)。我们还可以假设p只能取-1、0和1的值;所以我们不需要真正的乘法,如果我们能以某种方式计算 N/i,则 (N/i)*p[i] 可以用四个指令(pcmpgt、pxor、psub、pand)来完成。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是我能对代码进行矢量化的最接近的结果。我真的不希望它会更快。我只是尝试编写 SIMD 代码。
This is as close as I could get to vectorizing that code. I don't really expect it to be faster. I was just trying my hand at writting SIMD code.
1/x
的导数是-1/x^2
,这意味着随着x
变大,N/x== N/(x + 1)。
对于已知的
N/x
值(我们称该值r
),我们可以确定x
的下一个值(我们称该值x'
使得N/x':由于
我们处理的是整数:
因此,循环变成这样:
对于足够大的 N,您将
N/i
有很多次相同的值。当然,如果你不小心,你就会被零除。The derivative of
1/x
is-1/x^2
, which means asx
gets bigger,N/x==N/(x + 1)
.For a known value of
N/x
(let's call that valuer
), we can determine the next value ofx
(let's call that valuex'
such thatN/x'<r
:And since we are dealing with integers:
So, the loop becomes something like this:
For sufficiently large N, you will have many many runs of identical values for
N/i
. Granted, you will hit a divide by zero if you aren't careful.我建议您使用浮点 SIMD 运算来执行此操作 - 根据您的精度要求,可以是单精度也可以是双精度。使用 SSE 从 int 到 float 或 double 的转换相对较快。
I suggest you do this with floating point SIMD operations - either single or double precision depending on your accuracy requirements. Conversion from int to float or double is relatively fast using SSE.
成本集中在计算除法上。 SSE2 中没有用于整数除法的操作码,因此您必须自己一点一点地实现除法算法。我认为这不值得付出努力:SSE2 允许您并行执行两个实例(您使用 64 位数字,而 SSE2 寄存器是 128 位),但我发现手工除法算法可能至少是比 CPU
idiv
操作码慢两倍。(顺便说一句,您是在 32 位模式还是 64 位模式下编译?后者更适合 64 位整数。)
减少总除法数看起来是一种更有前途的方法。人们可能会注意到,对于正整数x和y,则floor(x/(2y)) = Floor(floor(x/y)/2)< /em>.在 C 术语中,一旦计算出
N/i
(截断除法),只需将其右移一位即可获得N/(2*i)
。使用得当,这使得一半的分区几乎是免费的(“正确”还包括以不会对缓存造成严重破坏的方式访问数十亿个p[i]
值,因此它不会看起来很容易)。The cost is concentrated in computing the divisions. There is no opcode in SSE2 for integral divisions, so you would have to implement a division algorithm yourself, bit by bit. I do not think it would be worth the effort: SSE2 allow you to perform two instances in parallel (you use 64-bit numbers, and SSE2 registers are 128-bit) but I find it likely that a handmade division algorithm would be at least twice as slow as the CPU
idiv
opcode.(By the way, do you compile in 32-bit or 64-bit mode ? The latter will be more comfortable with 64-bit integers.)
Reducing the overall number of divisions looks like a more promising way. One may note that for positive integers x and y, then floor(x/(2y)) = floor(floor(x/y)/2). In C terminology, once you have computed
N/i
(truncated division) then you just have to shift it right by one bit to obtainN/(2*i)
. Used properly, this makes half of your divisions almost free (that "properly" also includes accessing the billions ofp[i]
values in a way which does not wreak havoc with the caches, so it does not seem very easy).