为什么我的处理器没有内置 BigInt 支持?
据我了解,BigInts通常在大多数编程语言中实现为包含数字的数组,例如:当添加其中两个数字时,每个数字都会像我们在学校学到的那样被一个接一个地添加,例如:
246
816
* *
----
1062
其中*标记存在溢出。我在学校是这样学的,我实现的所有 BigInt 添加函数的工作都与上面的示例类似。
所以我们都知道,我们的处理器只能本机管理从 0 到 2^32
/ 2^64
的整数。
这意味着大多数脚本语言为了达到高级水平并提供大整数算术,必须实现/使用 BigInt 库,该库将整数作为数组使用,如上所述。 但这当然意味着它们会比处理器慢得多。
所以我问自己:
- 为什么我的处理器没有内置 BigInt 函数?
它的工作方式与任何其他 BigInt 库一样,只是速度更快且级别更低:处理器从缓存/RAM 中获取一位数字,将其相加,然后再次将结果写回。
对我来说似乎是个好主意,那么为什么没有类似的东西呢?
As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:
246
816
* *
----
1062
Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.
So we all know that our processors can only natively manage ints from 0 to 2^32
/ 2^64
.
That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above.
But of course this means that they'll be far slower than the processor.
So what I've asked myself is:
- Why doesn't my processor have a built-in BigInt function?
It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.
Seems like a fine idea to me, so why isn't there something like that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
BigInt:所需的基本函数是:
无符号整数乘法,加上前面的高位
我用 Intel 16 位汇编器写了一个,然后是 32 位......
C 代码通常足够快.. 例如,对于 BigInt,您使用软件库。
CPU(和 GPU)在设计时并未将无符号整数作为最高优先级。
如果你想编写自己的 BigInt...
除法是通过 Knuths Vol 2 完成的(它是一堆乘法和减法,还有一些棘手的回加)
带进位的加法和减法更容易。等等
我刚刚在英特尔发布了这个:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4有BigInt库吗?
i5 2410M 处理器我想不能使用 AVX [AVX 仅适用于最新的 Intel CPU]
但可以使用SSE4.2
SSE有BigInt库吗?
我想我正在寻找实现无符号整数
PMULUDQ (具有 128 位操作数) 的东西
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
并进行进位。
它是一台笔记本电脑,所以我不能购买 NVIDIA GTX 550,我听说这在未签名的整数上并不是那么伟大。
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
BigInt: the fundamental function required is:
Unsigned Integer Multiplication, add previous high order
I wrote one in Intel 16bit assembler, then 32 bit...
C code is usually fast enough .. ie for BigInt you use a software library.
CPUs (and GPUs) are not designed with unsigned Integer as top priority.
If you want to write your own BigInt...
Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)
Add with carry and subtract are easier. etc etc
I just posted this in Intel:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
SSE4 is there a BigInt LIbrary?
i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs]
but can use SSE4.2
Is there a BigInt Library for SSE?
I Guess I am looking for something that implements unsigned integer
PMULUDQ (with 128-Bit operands)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)
and does the carries.
Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
有太多的问题需要处理器处理大量不属于它的工作的事情。
假设处理器确实具有该功能。我们可以设计一个系统,在该系统中我们知道给定的 BigInt 使用了多少字节 - 只需使用与大多数字符串库相同的原理并记录长度即可。
但是,如果 BigInt 操作的结果超出了保留的空间量,会发生什么情况呢?
有两种选择:
或者
问题是,如果它做到了 1),那么它就没用了 - 你必须事先知道需要多少空间,这就是你想要使用 BigInt 的部分原因 - 所以你不受这些限制事物。
如果它执行了 2),那么它必须以某种方式分配该内存。跨操作系统的内存分配方式并不相同,但即使是这样,它仍然必须将所有指针更新为旧值。它如何知道什么是指向该值的指针,什么是包含与所讨论的内存地址相同的值的简单整数值?
There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.
Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.
But what would happen if the result of a BigInt operation exceeded the amount of space reserved?
There are two options:
or
The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.
If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?
二进制编码的十进制是字符串数学的一种形式。 Intel x86 处理器具有用于直接 BCD 算术运算的操作码。
Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.
几乎所有 CPU 都内置有此功能。您必须围绕相关指令使用软件循环,但如果循环有效,则不会使其变慢。 (这在 x86 上并不简单,因为部分标志停顿,请参见下文)
例如,如果 x86 提供
rep adc
来执行 src += dst,则采用 2 个指针和一个长度作为输入(如rep movsd
to memcpy),它仍然会作为微代码中的循环实现。32 位 x86 CPU 可能有一个在内部使用 64 位加法器的
rep adc
内部实现,因为 32 位 CPU 可能仍然有一个 64 位加法器。然而,64 位 CPU 可能没有单周期延迟 128b 加法器。因此,我不认为为此提供特殊指令会比使用软件的速度更快,至少在 64 位 CPU 上是这样。也许特殊的宽加法指令在低功耗低时钟速度 CPU 上很有用,其中可以实现具有单周期延迟的真正宽加法器。
您正在寻找的 x86 指令是:
adc
:添加进位 /sbb
:借位相减mul
:完全相乘,产生结果的上半部分和下半部分:例如64b*64b => 128bdiv
:被除数是其他操作数的两倍,例如128b/64b => 64b 划分。当然,
adc
适用于二进制整数,而不是单个十进制数字。 x86 可以在 8、16、32 或 64 位块中进行 ADC,这与 RISC CPU 不同,RISC CPU 通常仅在整个寄存器宽度上进行 ADC。 (GMP 将每个块称为“肢体”)。 (x86 有一些使用 BCD 或 ASCII 的指令,但这些指令在 x86-64 中被删除。)imul
/idiv
是带符号的等效项。加法对于带符号 2 的补码的作用与对于无符号的补码的作用相同,因此没有单独的指令;只需查看相关标志来检测有符号和无符号溢出。但对于adc
,请记住只有最重要的块才有符号位;其余的基本未签名。ADX 和 BMI/BMI2 添加了一些指令,例如
mulx
:全乘而不触及标志,因此它可以与adc
链交错,为超标量创建更多指令级并行性可供利用的CPU。在 x86 中,adc 甚至可以与内存目标一起使用,因此它的执行与您所描述的完全一样:一条指令触发对 BigInteger 块的整个读取/修改/写入。请参阅下面的示例。
大多数高级语言(包括 C/C++)不公开“进位”标志。
通常 C 中没有直接添加进位的内在函数。为了获得良好的性能,BigInteger 库通常必须用 asm 编写。
然而,英特尔实际上有 为
adc
定义的内部函数(和adcx
/adox
)。因此,进位结果在 C 中被处理为
unsigned char
。对于_addcarryx_u64
内在函数,由编译器来分析依赖链并决定哪个加法与 < code>adcx 以及如何使用adox
,以及如何将它们串在一起以实现 C 源代码。IDK _addcarryx 内在函数的意义是什么,而不是仅仅让编译器对现有的
_addcarry_u64
使用adcx
/adox
> 内在的,当存在可以利用它的并行 dep 链时。也许有些编译器不够聪明,无法做到这一点。以下是 NASM 语法中的 BigInteger 添加函数的示例:
在较旧的 CPU 上,尤其是 Sandybridge 之前的 CPU,在
dec
写入后读取 CF 时,adc
将导致部分标志停顿其他标志。 循环不同的指令将有助于旧 CPU 在合并部分标志写入时停止,但在 SnB 系列上不值得。循环展开对于
adc
循环也非常重要。adc
在 Intel 上解码为多个微指令,因此循环开销是一个问题,特别是如果您因避免部分标志停顿而有额外的循环开销。如果 len 是一个已知的小常量,则完全展开的循环通常是好的。 (例如编译器只使用添加
/adc 在 x86-64
上执行
uint128_t
。)带有内存目标的
adc
似乎不是最有效的方法,因为指针差异技巧让我们对 dst 使用单寄存器寻址模式。 (如果没有这个技巧,内存操作数就不会微熔断 )。根据 Haswell 和 Skylake 的 Agner Fog 指令表,
adc r,m
为2 uops(融合域),每 1 个时钟吞吐量 1 个,而 adc m, r/i 为 4 uops(融合域),每 2 个时钟吞吐量 1 个。显然,Broadwell/Skylake 将 adc r,r/i 作为单 uop 指令运行(利用 Haswell 为 FMA 引入的具有 3 个输入依赖项的 uop 的能力)并没有帮助。我也不能 100% 确定 Agner 的结果在这里,因为他没有意识到 SnB 系列 CPU 仅在解码器/uop 缓存中使用微熔丝索引寻址模式,而不是在乱序核心中。不管怎样,这个简单的完全不展开的循环是 6 uop,并且应该在 Intel SnB 系列 CPU 上每 2 个周期运行一次迭代。即使部分标志合并需要额外的微指令,这仍然远远少于 2 个周期内可以发出的 8 个融合域微指令。
一些小的展开可能会使每个周期接近 1 个 adc,因为该部分只有 4 个 uops。然而,每个周期 2 次加载和 1 次存储不太可持续。
利用加宽/缩小乘法和除法指令,也可以实现扩展精度乘法和除法。当然,由于乘法的性质,它要复杂得多。
使用 SSE 进行加进位或任何其他 BigInteger 操作都没有什么帮助。
如果您正在设计新的指令集,您可以这样做如果您有正确的指令来有效生成和传播进位,则 BigInteger 将添加到向量寄存器中。该线程对硬件中支持进位标志的成本和好处进行了一些来回讨论,与像 MIPS 那样让软件生成进位的成本和好处进行了一些反复讨论:比较以检测无符号环绕,将结果放入另一个整数寄存器中。
Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)
e.g. if x86 provided
rep adc
to do src += dst, taking 2 pointers and a length as input (likerep movsd
to memcpy), it would still be implemented as a loop in microcode.It would be possible for a 32bit x86 CPU to have an internal implementation of
rep adc
that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.
The x86 instructions you're looking for are:
adc
: add with carry /sbb
: subtract with borrowmul
: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128bdiv
: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.Of course,
adc
works on binary integers, not single decimal digits. x86 canadc
in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)imul
/idiv
are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But foradc
, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.ADX and BMI/BMI2 add some instructions like
mulx
: full-multiply without touching flags, so it can be interleaved with anadc
chain to create more instruction-level parallelism for superscalar CPUs to exploit.In x86,
adc
is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.Most high-level languages (including C/C++) don't expose a "carry" flag
Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.
However, Intel actually has defined intrinsics for
adc
(andadcx
/adox
).So the carry result is handled as an
unsigned char
in C. For the_addcarryx_u64
intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do withadcx
and which to do withadox
, and how to string them together to implement the C source.IDK what the point of
_addcarryx
intrinsics are, instead of just having the compiler useadcx
/adox
for the existing_addcarry_u64
intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.Here's an example of a BigInteger add function, in NASM syntax:
On older CPUs, especially pre-Sandybridge,
adc
will cause a partial-flag stall when reading CF afterdec
writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.Loop unrolling is also very important for
adc
loops.adc
decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. Iflen
is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just useadd
/adc
to do auint128_t
on x86-64.)adc
with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).According to Agner Fog's instruction tables for Haswell and Skylake,
adc r,m
is 2 uops (fused-domain) with one per 1 clock throughput, whileadc m, r/i
is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake runadc r,r/i
as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.
Some minor unrolling could get this close to 1
adc
per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.
It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.
If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.
假设乘法的结果需要 3 倍的空间(内存)来存储 - 处理器将在哪里存储该结果?该结果的用户(包括指向它的所有指针)如何知道它的大小突然发生了变化 - 并且更改大小可能需要它在内存中重新定位它,因为扩展当前位置会与另一个变量发生冲突。
这将在处理器、操作系统内存管理和编译器之间产生大量交互,而这将很难实现通用和高效。
管理应用程序类型的内存不是处理器应该做的事情。
Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.
This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.
Managing the memory of application types is not something the processor should do.
我认为,在现代处理器中不包括 bigint 支持的主要想法是希望减少 ISA 并留下尽可能少的指令,这些指令是全速读取、解码和执行的。
顺便说一下,在 x86 系列处理器中,有一组指令可以让编写 big int 库变得轻而易举。
我认为另一个原因是价格。放弃冗余操作可以节省晶圆上的一些空间,效率要高得多,这可以在更高级别上轻松实现。
As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle.
By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter.
Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.
英特尔似乎正在添加(或在本文@时间 - 2015 年添加)新的对大整数算术的指令支持。
http ://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
Seems Intel is Adding (or has added as @ time of this post - 2015) new Instructions Support for Large Integer Arithmetic.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
有如此多的指令和功能在 CPU 芯片上争夺空间,最终那些被更频繁使用/被认为更有用的指令和功能将淘汰那些不常用的指令和功能。实现 BigInt 功能所需的指令已经存在,并且数学计算也很简单。
There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.