为什么我的处理器没有内置 BigInt 支持？

发布于 2024-08-29 07:21:51 字数 537 浏览 10 评论 0原文

据我了解，BigInts通常在大多数编程语言中实现为包含数字的数组，例如：当添加其中两个数字时，每个数字都会像我们在学校学到的那样被一个接一个地添加，例如：

 246
 816
 * *
----
1062

其中*标记存在溢出。我在学校是这样学的，我实现的所有 BigInt 添加函数的工作都与上面的示例类似。

所以我们都知道，我们的处理器只能本机管理从 0 到 2^32 / 2^64 的整数。

这意味着大多数脚本语言为了达到高级水平并提供大整数算术，必须实现/使用 BigInt 库，该库将整数作为数组使用，如上所述。但这当然意味着它们会比处理器慢得多。

所以我问自己：

为什么我的处理器没有内置 BigInt 函数？

它的工作方式与任何其他 BigInt 库一样，只是速度更快且级别更低：处理器从缓存/RAM 中获取一位数字，将其相加，然后再次将结果写回。

对我来说似乎是个好主意，那么为什么没有类似的东西呢？

原文

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:

 246
 816
 * *
----
1062

Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.

So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.

That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.

So what I've asked myself is:

Why doesn't my processor have a built-in BigInt function?

It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.

Seems like a fine idea to me, so why isn't there something like that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

当爱已成负担 2024-09-05 07:21:52

BigInt：所需的基本函数是：
无符号整数乘法，加上前面的高位
我用 Intel 16 位汇编器写了一个，然后是 32 位......
C 代码通常足够快.. 例如，对于 BigInt，您使用软件库。
CPU（和 GPU）在设计时并未将无符号整数作为最高优先级。

如果你想编写自己的 BigInt...

除法是通过 Knuths Vol 2 完成的（它是一堆乘法和减法，还有一些棘手的回加）

带进位的加法和减法更容易。等等

我刚刚在英特尔发布了这个：
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4有BigInt库吗？

i5 2410M 处理器我想不能使用 AVX [AVX 仅适用于最新的 Intel CPU]
但可以使用SSE4.2

SSE有BigInt库吗？
我想我正在寻找实现无符号整数

PMULUDQ （具有 128 位操作数）的东西
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)

并进行进位。

它是一台笔记本电脑，所以我不能购买 NVIDIA GTX 550，我听说这在未签名的整数上并不是那么伟大。
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

回复收藏 0 原文

浅蓝的眸勾画不出的柔情 2024-09-05 07:21:51

有太多的问题需要处理器处理大量不属于它的工作的事情。

假设处理器确实具有该功能。我们可以设计一个系统，在该系统中我们知道给定的 BigInt 使用了多少字节 - 只需使用与大多数字符串库相同的原理并记录长度即可。

但是，如果 BigInt 操作的结果超出了保留的空间量，会发生什么情况呢？

有两种选择：

它会环绕在它所拥有的空间内
或者
它会使用更多内存。

问题是，如果它做到了 1)，那么它就没用了 - 你必须事先知道需要多少空间，这就是你想要使用 BigInt 的部分原因 - 所以你不受这些限制事物。

如果它执行了 2)，那么它必须以某种方式分配该内存。跨操作系统的内存分配方式并不相同，但即使是这样，它仍然必须将所有指针更新为旧值。它如何知道什么是指向该值的指针，什么是包含与所讨论的内存地址相同的值的简单整数值？

回复收藏 0 原文

冷月断魂刀 2024-09-05 07:21:51

二进制编码的十进制是字符串数学的一种形式。 Intel x86 处理器具有用于直接 BCD 算术运算的操作码。

回复收藏 0 原文

許願樹丅啲祈禱 2024-09-05 07:21:51

它的工作方式与任何其他 BigInt 库类似，只是速度更快且级别较低：处理器从缓存/RAM 中获取一位数字，将其相加，然后再次将结果写回。

几乎所有 CPU 都内置有此功能。您必须围绕相关指令使用软件循环，但如果循环有效，则不会使其变慢。（这在 x86 上并不简单，因为部分标志停顿，请参见下文）

例如，如果 x86 提供 rep adc 来执行 src += dst，则采用 2 个指针和一个长度作为输入（如 rep movsd to memcpy），它仍然会作为微代码中的循环实现。

32 位 x86 CPU 可能有一个在内部使用 64 位加法器的 rep adc 内部实现，因为 32 位 CPU 可能仍然有一个 64 位加法器。然而，64 位 CPU 可能没有单周期延迟 128b 加法器。因此，我不认为为此提供特殊指令会比使用软件的速度更快，至少在 64 位 CPU 上是这样。

也许特殊的宽加法指令在低功耗低时钟速度 CPU 上很有用，其中可以实现具有单周期延迟的真正宽加法器。

您正在寻找的 x86 指令是：

adc：添加进位 / sbb：借位相减
< a href="http://www.felixcloutier.com/x86/MUL.html" rel="nofollow noreferrer">mul：完全相乘，产生结果的上半部分和下半部分：例如64b*64b => 128b
div：被除数是其他操作数的两倍，例如128b/64b => 64b 划分。

当然，adc 适用于二进制整数，而不是单个十进制数字。 x86 可以在 8、16、32 或 64 位块中进行 ADC，这与 RISC CPU 不同，RISC CPU 通常仅在整个寄存器宽度上进行 ADC。（GMP 将每个块称为“肢体”）。（x86 有一些使用 BCD 或 ASCII 的指令，但这些指令在 x86-64 中被删除。）

imul / idiv 是带符号的等效项。加法对于带符号 2 的补码的作用与对于无符号的补码的作用相同，因此没有单独的指令；只需查看相关标志来检测有符号和无符号溢出。但对于adc，请记住只有最重要的块才有符号位；其余的基本未签名。

ADX 和 BMI/BMI2 添加了一些指令，例如 mulx：全乘而不触及标志，因此它可以与 adc 链交错，为超标量创建更多指令级并行性可供利用的CPU。

在 x86 中，adc 甚至可以与内存目标一起使用，因此它的执行与您所描述的完全一样：一条指令触发对 BigInteger 块的整个读取/修改/写入。请参阅下面的示例。

大多数高级语言（包括 C/C++）不公开“进位”标志。

通常 C 中没有直接添加进位的内在函数。为了获得良好的性能，BigInteger 库通常必须用 asm 编写。

然而，英特尔实际上有为 adc 定义的内部函数（和 adcx / adox ）。

unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
                             unsigned __int64 b, unsigned __int64 * out);

因此，进位结果在 C 中被处理为 unsigned char。对于 _addcarryx_u64 内在函数，由编译器来分析依赖链并决定哪个加法与 < code>adcx 以及如何使用 adox，以及如何将它们串在一起以实现 C 源代码。

IDK _addcarryx 内在函数的意义是什么，而不是仅仅让编译器对现有的 _addcarry_u64 使用 adcx/adox > 内在的，当存在可以利用它的并行 dep 链时。也许有些编译器不够聪明，无法做到这一点。

以下是 NASM 语法中的 BigInteger 添加函数的示例：

;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
;   len is an element-count, not byte-count

global bigint_add
bigint_add:   ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx

                              ; set up for using dst as an index for src
    sub    rsi, rdi           ;  rsi -= dst.  So orig_src = rsi + rdi

    clc                           ;  CF=0 to set up for the first adc
           ; alternative: peel the first iteration and use add instead of adc

.loop:
    mov    rax, [rsi + rdi]   ; load from src
    adc    rax, [rdi]         ;  <================= ADC with dst
    mov    [rdi], rax         ; store back into dst.  This appears to be cheaper than  adc  [rdi], rax  since we're using a non-indexed addressing mode that can micro-fuse

    lea    rdi,  [rdi + 8]    ; pointer-increment without clobbering CF
    dec    rdx                ; preserves CF
    jnz    .loop              ; loop while(--len)

    ret

在较旧的 CPU 上，尤其是 Sandybridge 之前的 CPU，在 dec 写入后读取 CF 时，adc 将导致部分标志停顿其他标志。循环不同的指令将有助于旧 CPU 在合并部分标志写入时停止，但在 SnB 系列上不值得。

循环展开对于 adc 循环也非常重要。 adc 在 Intel 上解码为多个微指令，因此循环开销是一个问题，特别是如果您因避免部分标志停顿而有额外的循环开销。如果 len 是一个已知的小常量，则完全展开的循环通常是好的。（例如编译器只使用 添加/adc 在 x86-64上执行 uint128_t。）

带有内存目标的 adc 似乎不是最有效的方法，因为指针差异技巧让我们对 dst 使用单寄存器寻址模式。（如果没有这个技巧，内存操作数就不会微熔断）。

根据 Haswell 和 Skylake 的 Agner Fog 指令表，adc r,m 为2 uops（融合域），每 1 个时钟吞吐量 1 个，而 adc m, r/i 为 4 uops（融合域），每 2 个时钟吞吐量 1 个。显然，Broadwell/Skylake 将 adc r,r/i 作为单 uop 指令运行（利用 Haswell 为 FMA 引入的具有 3 个输入依赖项的 uop 的能力）并没有帮助。我也不能 100% 确定 Agner 的结果在这里，因为他没有意识到 SnB 系列 CPU 仅在解码器/uop 缓存中使用微熔丝索引寻址模式，而不是在乱序核心中。

不管怎样，这个简单的完全不展开的循环是 6 uop，并且应该在 Intel SnB 系列 CPU 上每 2 个周期运行一次迭代。即使部分标志合并需要额外的微指令，这仍然远远少于 2 个周期内可以发出的 8 个融合域微指令。

一些小的展开可能会使每个周期接近 1 个 adc，因为该部分只有 4 个 uops。然而，每个周期 2 次加载和 1 次存储不太可持续。

利用加宽/缩小乘法和除法指令，也可以实现扩展精度乘法和除法。当然，由于乘法的性质，它要复杂得多。

使用 SSE 进行加进位或任何其他 BigInteger 操作都没有什么帮助。

如果您正在设计新的指令集，您可以这样做如果您有正确的指令来有效生成和传播进位，则 BigInteger 将添加到向量寄存器中。该线程对硬件中支持进位标志的成本和好处进行了一些来回讨论，与像 MIPS 那样让软件生成进位的成本和好处进行了一些反复讨论：比较以检测无符号环绕，将结果放入另一个整数寄存器中。

It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.

Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)

e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.

It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.

Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.

The x86 instructions you're looking for are:

adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.

Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)

imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.

ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.

In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.

Most high-level languages (including C/C++) don't expose a "carry" flag

Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.

However, Intel actually has defined intrinsics for adc (and adcx / adox).

unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
                             unsigned __int64 b, unsigned __int64 * out);

So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.

IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.

Here's an example of a BigInteger add function, in NASM syntax:

;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
;   len is an element-count, not byte-count

global bigint_add
bigint_add:   ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx

                              ; set up for using dst as an index for src
    sub    rsi, rdi           ;  rsi -= dst.  So orig_src = rsi + rdi

    clc                           ;  CF=0 to set up for the first adc
           ; alternative: peel the first iteration and use add instead of adc

.loop:
    mov    rax, [rsi + rdi]   ; load from src
    adc    rax, [rdi]         ;  <================= ADC with dst
    mov    [rdi], rax         ; store back into dst.  This appears to be cheaper than  adc  [rdi], rax  since we're using a non-indexed addressing mode that can micro-fuse

    lea    rdi,  [rdi + 8]    ; pointer-increment without clobbering CF
    dec    rdx                ; preserves CF
    jnz    .loop              ; loop while(--len)

    ret

On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.

Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)

adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).

According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.

Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.

Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.

Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.

It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.

If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.

回复收藏 0 原文

つ低調成傷 2024-09-05 07:21:51

假设乘法的结果需要 3 倍的空间（内存）来存储 - 处理器将在哪里存储该结果？该结果的用户（包括指向它的所有指针）如何知道它的大小突然发生了变化 - 并且更改大小可能需要它在内存中重新定位它，因为扩展当前位置会与另一个变量发生冲突。

这将在处理器、操作系统内存管理和编译器之间产生大量交互，而这将很难实现通用和高效。

管理应用程序类型的内存不是处理器应该做的事情。

回复收藏 0 原文

邮友 2024-09-05 07:21:51

我认为，在现代处理器中不包括 bigint 支持的主要想法是希望减少 ISA 并留下尽可能少的指令，这些指令是全速读取、解码和执行的。
顺便说一下，在 x86 系列处理器中，有一组指令可以让编写 big int 库变得轻而易举。
我认为另一个原因是价格。放弃冗余操作可以节省晶圆上的一些空间，效率要高得多，这可以在更高级别上轻松实现。

回复收藏 0 原文