x86-64 ISA 的 32 位指针:为什么不呢?
x86-64 指令集添加了更多寄存器和其他改进,以帮助简化可执行代码。然而,在许多应用程序中,增加的指针大小是一种负担。每个指针中多余的、未使用的字节会堵塞缓存,甚至可能溢出 RAM。例如,GCC 使用 -m32
标志构建,我认为这就是原因。
可以加载 32 位值并将其视为指针。这不需要额外的指令,只需加载/计算 32 位并从结果地址加载即可。不过,由于平台具有不同的内存映射,因此该技巧无法移植。在 Mac OS X 上,保留整个低 4 GiB 地址空间。尽管如此,对于我编写的一个程序,在使用之前将 0x100000000L
添加到 32 位“地址”,比真正的 64 位地址大大提高了性能,或者使用 -m32
进行编译。
拥有 32 位 x86-64 平台有什么根本障碍吗?我认为支持这样的嵌合体会增加任何操作系统的复杂性,任何想要最后 20% 的人都应该使用 Make it Work™,但它似乎仍然最适合各种计算密集型程序。
The x86-64 instruction set adds more registers and other improvements to help streamline executable code. However, in many applications the increased pointer size is a burden. The extra, unused bytes in every pointer clog up the cache and might even overflow RAM. GCC, for example, builds with the -m32
flag, and I assume this is the reason.
It's possible to load a 32-bit value and treat it as a pointer. This doesn't necessitate extra instructions, just load/compute the 32 bits and load from the resulting address. The trick won't be portable, though, as platforms have different memory maps. On Mac OS X, the entire low 4 GiB of address space is reserved. Still, for one program I wrote, hackishly adding 0x100000000L
to 32-bit "addresses" before use improved performance greatly over true 64-bit addresses, or compiling with -m32
.
Is there any fundamental impediment to having a 32-bit, x86-64 platform? I suppose that supporting such a chimera would add complexity to any operating system, and anyone wanting that last 20% should just Make it Work™, but it still seems that this would be the best fit for a variety of computationally intensive programs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Linux 正在开发一个名为“x32”的 ABI。它是 x86_64 和 ia32 之间的混合,类似于您所描述的 - 32 位地址空间,同时使用完整的 64 位寄存器集。它需要自定义内核、binutils 和 gcc。
一些 SPEC 运行表明,某些基准测试的性能提高了约 30%。如需了解更多信息,请访问 https://sites.google.com/site/x32abi/
There is an ABI called "x32" for linux in development. It's a mix between x86_64 and ia32 similar to what you describe - 32 bit address space while using the full 64 bit register set. It needs a custom kernel, binutils and gcc.
Some SPEC runs indicate a performace improvement of about 30% in some benchmarks. See further information at https://sites.google.com/site/x32abi/
是的,您可以限制程序仅使用前 2/4 GB 地址空间,或使用具有 32 位(或更少)偏移量的 64 位基址
As 神秘在上面评论,ICC 甚至可以自动执行此操作。它具有
-auto-ilp32
//Qauto-ilp32
选项 使用 32 位指针在 64 位模式下(如果适用):但是,如果您无权访问 ICC 或者想要对输出代码生成有更多控制,那么在 Linux 上有 x32abi 正如其他人提到的
,在 Windows 上没有像 Linux 上那样的 x32abi,但您仍然可以通过禁用 <代码>/LARGEADDRESSAWARE<默认情况下为 x86-64 二进制文件启用的 /code> 标志
当然没有直接的编译器支持类似于 GCC 中的
-mx32
选项,因此每次存储指向内存的指针或取消引用它时,您可能需要手动处理指针。最简单的解决方案是编写一个包装 32 位指针的类来处理该问题。幸运的是,MS 在同一架构中混合 32 位和 64 位指针方面也有经验,因此他们有很多支持关键字/宏:POINTER_32
/__ptr32
POINTER_64
/__ptr64
POINTER_SIGNED
/__sptr
POINTER_UNSIGNED
/__uptr
您还可以强制所有内存分配发生在 4 GB 标记以下并处理手动执行所有操作
无论如何,限制到第一个 2/4 GB 内存页面可能不会是可行的,因为缺乏内存或 ASLR。您可以告诉操作系统围绕某个 64 位基地址分配内存。这样,您就可以为大于 4GB 的地址空间提供多个基址。Google
的 V8 引擎使用它来 压缩指针为32位以节省内存以及提高性能。请此处查看内存和性能改进的比较。他们甚至讨论了一个很好的优化,通过将基址设置为 FS/GS 段寄存器并释放另一个通用寄存器
或者如果您的指针始终对齐,那么您可以删除低位来寻址更大数量的内存,就像在 JVM 的“压缩”中一样。哎呀”,它总是寻址 8 字节对齐的对象
另请参阅 V8 中的压缩指针实现与 JVM 的压缩指针实现有何不同糟糕?
了解更多
Yes, you can limit the program to use the first 2/4 GB address space only, or use a 64-bit base with 32-bit (or less) offset
As Mysticial commented above, ICC can even automatically do that. It has the
-auto-ilp32
//Qauto-ilp32
option to use 32-bit pointers in 64-bit mode if applicable:But if you don't have access to ICC or want to have more control over the output codegen then on Linux there's x32abi as others have mentioned
On Windows there's no x32abi like on Linux, but you can still use 32-bit pointers by disabling the
/LARGEADDRESSAWARE
flag which is enabled for x86-64 binaries by defaultOf course there's no direct compiler support like the
-mx32
option in GCC, therefore you may need to deal with pointers manually every time you store a pointer to memory or dereference it. The simplest solution is to write a class wrapping a 32-bit pointer to handle that. Luckily MS also had experience on mixed 32 and 64-bit pointers in the same architecture so they have lots of supporting keywords/macros:POINTER_32
/__ptr32
POINTER_64
/__ptr64
POINTER_SIGNED
/__sptr
POINTER_UNSIGNED
/__uptr
You can also force all memory allocations to happen below the 4 GB mark and handle everything manually
Anyway, limit to the first 2/4 GB memory page might not be feasible, because of the lack of memory or the reduced effectiveness of ASLR. You can tell the OS to allocate memory around some 64-bit base address instead. This way you can have multiple bases for an address space larger than 4GB
Google's V8 engine uses this to compress pointers to 32 bits to save memory as well as improve performance. See the comparison in memory and performance improvement here. They even discuss a nice optimization by setting the base to FS/GS segment register and free another general-purpose register
Or if your pointers are always aligned then you can drop the low bits to address a larger amount of memory, like in JVM's "compressed Oops" which always address 8-byte aligned objects
See also How does the compressed pointer implementation in V8 differ from JVM's compressed Oops?
Read more
我预计在操作系统中支持这样的模型并不困难。此模型中进程唯一需要更改的是页面管理,页面分配必须低于 4 GB 点。如果内核将缓冲区传递给应用程序,那么它也应该从虚拟地址空间的前 4 GB 中分配缓冲区。这同样适用于加载和启动应用程序的加载程序。除此之外,64 位内核应该能够处理此类应用程序,无需进行重大修改。
编译器支持也不应该是一个大问题。主要是生成可以使用额外 CPU 寄存器及其完整 64 位的代码,并在需要时添加适当的 REX 前缀。
I do not expect it very hard to support such a model in the OS. About the only thing that needs to change for processes in this model is page management, pages must be allocated below the 4 GB point. The kernel too should allocate its buffers from the first 4 GBs of the virtual address space if it passes them to the application. The same applies to the loader that loads and starts applications. Other than that a 64-bit kernel should be able handle such apps w/o major modifications.
Compiler support shouldn't be a big issue either. It's mostly a matter of generating code that can use the extra CPU registers and their full 64 bits and adding proper REX prefixes whenever needed.
它被称为“x86-32 仿真”,或者 Windows 上的 WOW64(可能是其他操作系统上的其他名称),它是处理器中的硬件标志。这里不需要任何用户模式技巧。
It's called "x86-32 emulation", or WOW64 on Windows (presumably something else on other OSes) and it's a hardware flag in the processor. No need for any user-mode tricks here.