优化 x64 汇编器 MUL 循环
我正在编写需要快速乘以大数的数学代码。它分解为整数数组与单个整数的乘法。在 C++ 中,这看起来像这样(在无符号的情况下):
void muladd(unsigned* r, const unsigned* a, unsigned len, unsigned b) {
unsigned __int64 of = 0; // overflow
unsigned i = 0; // loop variable
while (i < len) {
of += (unsigned __int64)a[i] * b + r[i];
r[i] = (unsigned)of;
of >>= 32;
++i;
}
r[i] = (unsigned)of; // save overflow
}
我手动展开此循环,将其转换为 64 位,并处理 .asm 编译器输出以进一步优化它。主 .asm 循环现在如下所示:
mov rax, rdi ; rdi = b
mul QWORD PTR [rbx+r10*8-64] ; rdx:rax = a[i] * b; r10 = i
mov rsi, QWORD PTR [r14+r10*8-64] ; r14 = r; rsi = r[i]
add rax, rsi
adc rdx, 0
add rax, r11 ; r11 = of (low part)
adc rdx, 0
mov QWORD PTR [r14+r10*8-64], rax ; save result
mov r11, rdx
; this repeats itself 8 times with different offsets
当我对此进行基准测试时,我发现在我的 Core2 Quad 上每次乘法平均需要大约 6.3 个周期。
我的问题是:我可以以某种方式加快速度吗?不幸的是,我认为没有办法避免其中一项加法,并且乘法总是需要 RDX:RAX,所以我需要移动数据并且不能进行“并行乘法”。
有人有什么想法吗?
更新: 经过更多测试后,我成功地将每个 64 位 MUL 的速度提高到大约 5.4 个周期(包括所有添加、移动和循环开销)。我想这大概是 Core2 上能得到的最好结果了,因为 Core2 没有非常快的 MUL 指令:它的吞吐量为 3,延迟为 6(或 7)个周期。 Sandy Bridge 的吞吐量会更好,吞吐量为 1,延迟为 3(或 4)个周期。
关于 GMP 的数字要小得多:我从他们的源代码中得到了这个数字,在我看来,这是一个理论数字。但可以肯定的是,这是针对 AMD K9 CPU 计算的数字。从我读到的内容来看,我发现 AMD 的 MUL 单元比(旧的)英特尔芯片更快。
I'm writing math code which needs to multiply large numbers fast. It breaks down to multiplications of an array of integers with a single integer. In C++ this looks like this (on unsigned's):
void muladd(unsigned* r, const unsigned* a, unsigned len, unsigned b) {
unsigned __int64 of = 0; // overflow
unsigned i = 0; // loop variable
while (i < len) {
of += (unsigned __int64)a[i] * b + r[i];
r[i] = (unsigned)of;
of >>= 32;
++i;
}
r[i] = (unsigned)of; // save overflow
}
I unrolled this loop manually, converted it to 64 bit and worked on the .asm compiler output to optimize it further. The main .asm loop now looks like this:
mov rax, rdi ; rdi = b
mul QWORD PTR [rbx+r10*8-64] ; rdx:rax = a[i] * b; r10 = i
mov rsi, QWORD PTR [r14+r10*8-64] ; r14 = r; rsi = r[i]
add rax, rsi
adc rdx, 0
add rax, r11 ; r11 = of (low part)
adc rdx, 0
mov QWORD PTR [r14+r10*8-64], rax ; save result
mov r11, rdx
; this repeats itself 8 times with different offsets
When I benchmark this, I find that it takes about 6.3 cycles on avarage per multiplication on my Core2 Quad.
My question is: can I speed this up somehow? Unfortunately, I see no way to avoid one of the additions and the multiplication always needs RDX:RAX, so I need to move the data around and can not sort of "multiply in parallel".
Any ideas anyone?
Update:
After some more testing, I've managed to bring the speed up to about 5.4 cycles per 64-bit MUL (that includes all add, move and loop overhead). I guess this is about the best you can get on a Core2, since the Core2 does not have a very fast MUL instruction: it has a throughput of 3 and a latency of 6 (resp. 7) cycles. Sandy bridge will be much better with a throughput of 1 and a latency of 3 (resp. 4) cycles.
Regarding the much lesser number for GMP: I got that from their source code and it seems to me that it is a theoretical number. But what's sure is that it's a number that was calculated for an AMD K9 CPU. And from what I have read I gather the AMDs have a faster MUL unit than the (older) Intel chips.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我曾经写过一个看起来像这样的循环,对大量数据进行最少的处理,结果循环受到内存速度的限制。
我会尝试预取 a[i] 和 r[i]
如果使用 gcc 在汇编器中使用函数 __builtin_prefetch() 或 PREFETCHT0 指令,
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
当这个方法起作用时,结果可能是戏剧性的。只要循环有一千次左右的迭代长,我就会预取 a[i+64] 和 r[i+64] 作为起点,看看它对您的 CPU 产生多大的影响。您可能需要尝试更大的预取距离。
I once wrote a loop that looks rather like this, with a minimal amount of processing on a lot of data with the result that the loop was limited by memory speed.
I'd try prefetching a[i] and r[i]
if using gcc use the function __builtin_prefetch() or PREFETCHT0 instruction in assembler
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
When this works the results can be dramatic. So long as the loop is a thousand or so iterations long, I'd prefetch a[i+64] and r[i+64] as a starting point and see how much difference it makes on your CPU. You may need to try larger prefetch distances.
我只是想指出,周期计数是相当无用的,因为您的指令将被转换为微代码,这些微代码将根据 CPU 正在执行的其他操作而无序执行或暂停。如果你有一个快速的例程(你确实这样做了),那么尝试缩短理论周期并不会真正取得成果,除非你知道你的例程将始终完全孤立地运行。
I'd just like to point out that cycle-counting is rather useless as your instructions will be converted to microcode that will be executed out of order or paused depending on everything else the cpu is doing. If you have a fast routine, which you do, it's not really fruitfull to try and shave off a theoretical cycle unless you know your routine will always run in complete isolation.
r 在调用之前是否包含任何重要内容?
如果确实如此,并且您正在积累它,那么现在就停止阅读。
如果没有(即您总是累加到零),并且假设您在明显大于缓存大小的数组上调用此函数,那么我将寻找一种方法来消除从 r 读取的需要将“保存结果”
MOV
转换为MOVNT
(内在函数中的_mm_stream_ps
)。这可以显着提高性能。如何 ?目前,您的缓存正在从 a 获取缓存行,从 r 获取缓存行并将缓存行写回 r。使用所谓的流存储,您只需从 a 获取缓存行并直接写入 r:总线流量少得多。如果您查看任何现代 CRT 的 memcpy 实现,它将切换到使用高于某些缓存大小相关阈值的流存储(并且运行速度几乎是两倍 作为使用常规移动的 memcpy)。
Does r contain anything important before the call ?
If it does, and you're accumulating onto it, then stop reading now.
If it doesn't (ie you're always accumulating onto zeros), and assuming you're invoking this function on arrays significantly larger than cache sizes, then I'd be looking for a way to eliminate the need to read from r and to convert the "save result"
MOV
to aMOVNT
(_mm_stream_ps
in intrinsics).This can significantly improve performance. How ? Currently your caches are fetching cache-lines from a, fetching cache lines from r and writing cache lines back to r. With the so-calling streaming stores you'd just fetch cache lines from a and write-through straight to r: much less bus traffic. If you look at any modern CRT's memcpy implementation it will switch to using streaming stores above some cache-size related threshold (and run almost twice as fast as a memcpy using conventional moves).
看来您的日常生活可以从 SSE 中受益。 PMULLD 和 PADDD 似乎是相关指令。不知道为什么你的编译器不从中生成 SSE。
Looks like your routine could benefit from SSE. PMULLD and PADDD seems like relevant instructions. Not sure why your compiler does not produce SSE from that.