优化 x64 汇编器 MUL 循环

发布于 2024-12-15 22:01:41 字数 1434 浏览 7 评论 0原文

我正在编写需要快速乘以大数的数学代码。它分解为整数数组与单个整数的乘法。在 C++ 中，这看起来像这样（在无符号的情况下）：

void muladd(unsigned* r, const unsigned* a, unsigned len, unsigned b) {
   unsigned __int64 of = 0;  // overflow
   unsigned i = 0;  // loop variable
   while (i < len) {
      of += (unsigned __int64)a[i] * b + r[i];
      r[i] = (unsigned)of;
      of >>= 32;
      ++i;
   }
   r[i] = (unsigned)of;  // save overflow
}

我手动展开此循环，将其转换为 64 位，并处理 .asm 编译器输出以进一步优化它。主 .asm 循环现在如下所示：

mov   rax, rdi                             ; rdi = b
mul   QWORD PTR [rbx+r10*8-64]             ; rdx:rax = a[i] * b; r10 = i
mov   rsi, QWORD PTR [r14+r10*8-64]        ; r14 = r; rsi = r[i]
add   rax, rsi
adc   rdx, 0
add   rax, r11                             ; r11 = of (low part)
adc   rdx, 0
mov   QWORD PTR [r14+r10*8-64], rax        ; save result
mov   r11, rdx

; this repeats itself 8 times with different offsets

当我对此进行基准测试时，我发现在我的 Core2 Quad 上每次乘法平均需要大约 6.3 个周期。

我的问题是：我可以以某种方式加快速度吗？不幸的是，我认为没有办法避免其中一项加法，并且乘法总是需要 RDX:RAX，所以我需要移动数据并且不能进行“并行乘法”。

有人有什么想法吗？

更新： 经过更多测试后，我成功地将每个 64 位 MUL 的速度提高到大约 5.4 个周期（包括所有添加、移动和循环开销）。我想这大概是 Core2 上能得到的最好结果了，因为 Core2 没有非常快的 MUL 指令：它的吞吐量为 3，延迟为 6（或 7）个周期。 Sandy Bridge 的吞吐量会更好，吞吐量为 1，延迟为 3（或 4）个周期。

关于 GMP 的数字要小得多：我从他们的源代码中得到了这个数字，在我看来，这是一个理论数字。但可以肯定的是，这是针对 AMD K9 CPU 计算的数字。从我读到的内容来看，我发现 AMD 的 MUL 单元比（旧的）英特尔芯片更快。

原文

I'm writing math code which needs to multiply large numbers fast. It breaks down to multiplications of an array of integers with a single integer. In C++ this looks like this (on unsigned's):

void muladd(unsigned* r, const unsigned* a, unsigned len, unsigned b) {
   unsigned __int64 of = 0;  // overflow
   unsigned i = 0;  // loop variable
   while (i < len) {
      of += (unsigned __int64)a[i] * b + r[i];
      r[i] = (unsigned)of;
      of >>= 32;
      ++i;
   }
   r[i] = (unsigned)of;  // save overflow
}

I unrolled this loop manually, converted it to 64 bit and worked on the .asm compiler output to optimize it further. The main .asm loop now looks like this:

mov   rax, rdi                             ; rdi = b
mul   QWORD PTR [rbx+r10*8-64]             ; rdx:rax = a[i] * b; r10 = i
mov   rsi, QWORD PTR [r14+r10*8-64]        ; r14 = r; rsi = r[i]
add   rax, rsi
adc   rdx, 0
add   rax, r11                             ; r11 = of (low part)
adc   rdx, 0
mov   QWORD PTR [r14+r10*8-64], rax        ; save result
mov   r11, rdx

; this repeats itself 8 times with different offsets

When I benchmark this, I find that it takes about 6.3 cycles on avarage per multiplication on my Core2 Quad.

My question is: can I speed this up somehow? Unfortunately, I see no way to avoid one of the additions and the multiplication always needs RDX:RAX, so I need to move the data around and can not sort of "multiply in parallel".

Any ideas anyone?

Update:
After some more testing, I've managed to bring the speed up to about 5.4 cycles per 64-bit MUL (that includes all add, move and loop overhead). I guess this is about the best you can get on a Core2, since the Core2 does not have a very fast MUL instruction: it has a throughput of 3 and a latency of 6 (resp. 7) cycles. Sandy bridge will be much better with a throughput of 1 and a latency of 3 (resp. 4) cycles.

Regarding the much lesser number for GMP: I got that from their source code and it seems to me that it is a theoretical number. But what's sure is that it's a number that was calculated for an AMD K9 CPU. And from what I have read I gather the AMDs have a faster MUL unit than the (older) Intel chips.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

各自安好 2024-12-22 22:01:41

我曾经写过一个看起来像这样的循环，对大量数据进行最少的处理，结果循环受到内存速度的限制。

我会尝试预取 a[i] 和 r[i]

如果使用 gcc 在汇编器中使用函数 __builtin_prefetch() 或 PREFETCHT0 指令，

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

当这个方法起作用时，结果可能是戏剧性的。只要循环有一千次左右的迭代长，我就会预取 a[i+64] 和 r[i+64] 作为起点，看看它对您的 CPU 产生多大的影响。您可能需要尝试更大的预取距离。

回复收藏 0 原文

來不及說愛妳 2024-12-22 22:01:41

我只是想指出，周期计数是相当无用的，因为您的指令将被转换为微代码，这些微代码将根据 CPU 正在执行的其他操作而无序执行或暂停。如果你有一个快速的例程（你确实这样做了），那么尝试缩短理论周期并不会真正取得成果，除非你知道你的例程将始终完全孤立地运行。

回复收藏 0 原文

奢华的一滴泪 2024-12-22 22:01:41

r 在调用之前是否包含任何重要内容？

如果确实如此，并且您正在积累它，那么现在就停止阅读。

如果没有（即您总是累加到零），并且假设您在明显大于缓存大小的数组上调用此函数，那么我将寻找一种方法来消除从 r 读取的需要将“保存结果”MOV 转换为 MOVNT（内在函数中的 _mm_stream_ps）。

这可以显着提高性能。如何？目前，您的缓存正在从 a 获取缓存行，从 r 获取缓存行并将缓存行写回 r。使用所谓的流存储，您只需从 a 获取缓存行并直接写入 r：总线流量少得多。如果您查看任何现代 CRT 的 memcpy 实现，它将切换到使用高于某些缓存大小相关阈值的流存储（并且运行速度几乎是两倍作为使用常规移动的 memcpy）。

回复收藏 0 原文