对齐堆栈是什么意思?

发布于 2024-10-02 07:02:49 字数 1192 浏览 7 评论 0原文

我一直是一名高级编码员,架构对我来说相当陌生,所以我决定在这里阅读有关 Assembly 的教程:

http://en.wikibooks.org/wiki/X86_Assembly/Print_Version

在教程的最下方,说明了如何转换 Hello World! 给出了将程序

#include <stdio.h>

int main(void) {
    printf("Hello, world!\n");
    return 0;
}

转换为等效的汇编代码并生成了以下内容:

        .text
LC0:
        .ascii "Hello, world!\12\0"
.globl _main
_main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $0, %eax
        movl    %eax, -4(%ebp)
        movl    -4(%ebp), %eax
        call    __alloca
        call    ___main
        movl    $LC0, (%esp)
        call    _printf
        movl    $0, %eax
        leave
        ret

对于其中一行,

andl    $-16, %esp

解释是:

此代码“和”ESP 为 0xFFFFFFF0, 将堆栈与下一个堆栈对齐 最低 16 字节边界。一个 检查Mingw的源代码 表明这可能适用于 SIMD 出现在“_main”中的指令 例程,仅在对齐的情况下运行 地址。因为我们的日常生活不 包含SIMD指令,这一行 没有必要。

我不明白这一点。有人能给我解释一下将堆栈与下一个 16 字节边界对齐意味着什么以及为什么需要它吗? andl 是如何实现这一目标的?

I have been a high-level coder, and architectures are pretty new to me, so I decided to read the tutorial on Assembly here:

http://en.wikibooks.org/wiki/X86_Assembly/Print_Version

Far down the tutorial, instructions on how to convert the Hello World! program

#include <stdio.h>

int main(void) {
    printf("Hello, world!\n");
    return 0;
}

into equivalent assembly code was given and the following was generated:

        .text
LC0:
        .ascii "Hello, world!\12\0"
.globl _main
_main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $0, %eax
        movl    %eax, -4(%ebp)
        movl    -4(%ebp), %eax
        call    __alloca
        call    ___main
        movl    $LC0, (%esp)
        call    _printf
        movl    $0, %eax
        leave
        ret

For one of the lines,

andl    $-16, %esp

the explanation was:

This code "and"s ESP with 0xFFFFFFF0,
aligning the stack with the next
lowest 16-byte boundary. An
examination of Mingw's source code
reveals that this may be for SIMD
instructions appearing in the "_main"
routine, which operate only on aligned
addresses. Since our routine doesn't
contain SIMD instructions, this line
is unnecessary.

I do not understand this point. Can someone give me an explanation of what it means to align the stack with the next 16-byte boundary and why it is required? And how is the andl achieving this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

故人如初 2024-10-09 07:02:49

假设进入 _main 时堆栈看起来像这样(堆栈指针的地址只是一个示例):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

压入 %ebp,并从 %esp 中减去 8 为局部变量保留一些空间:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

现在,andl 指令将 %esp 的低 4 位清零,可能减少它;在这个特定的例子中,它具有保留额外 4 个字节的效果:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

这一点是有一些“SIMD”(单指令,多数据)指令(在 x86 领域也称为“SSE”,表示“流”) SIMD 扩展”)可以对内存中的多个字执行并行操作,但要求这些多个字是从 16 字节的倍数地址开始的块。

一般来说,编译器不能假设来自 %esp 的特定偏移量将产生合适的地址(因为进入函数时 %esp 的状态取决于调用代码)。但是,通过以这种方式故意对齐堆栈指针,编译器知道将 16 字节的任意倍数添加到堆栈指针将产生 16 字节对齐的地址,这对于这些 SIMD 指令来说是安全的。

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

贱贱哒 2024-10-09 07:02:49

这听起来不是特定于堆栈的,而是一般意义上的对齐。也许想想整数倍这个词。

如果内存中的项目大小为一个字节,单位为 1,那么可以说它们都是对齐的。大小为两个字节的东西,则整数乘以 2 将对齐,0、2、4、6、8 等。而非整数倍数,1、3、5、7 将不对齐。大小为 4 字节、0、4、8、12 等整数倍的项目是对齐的,1、2、3、5、6、7 等不是对齐的。 8、0、8、16、24 和 16 16、32、48、64 等也是如此。

这意味着您可以查看该项目的基地址并确定它是否对齐。

size in bytes, address in the form of 
1, xxxxxxx
2, xxxxxx0
4, xxxxx00
8, xxxx000
16,xxx0000
32,xx00000
64,x000000
and so on

在编译器将数据与 .text 段中的指令混合的情况下,根据需要对齐数据是相当简单的(当然,取决于体系结构)。但堆栈是运行时的东西,编译器通常无法确定运行时堆栈的位置。因此,在运行时,如果您有需要对齐的局部变量,则需要让代码以编程方式调整堆栈。

例如,假设堆栈上有两个 8 字节项目,总共 16 个字节,并且您确实希望它们对齐(在 8 字节边界上)。在入口处,函数会照常从堆栈指针中减去 16,为这两项腾出空间。但为了对齐它们,需要更多的代码。如果我们希望这两个 8 字节项在 8 字节边界上对齐,并且减去 16 后的堆栈指针为 0xFF82,那么低 3 位不为 0,因此它没有对齐。低三位是0b010。一般而言,我们希望从 0xFF82 中减去 2,得到 0xFF80。我们如何确定它是 2,方法是与 0b111 (0x7) 进行“与”运算,然后减去该值。这对于 alu 运算来说意味着“与”和“减”。但是我们可以采取一种捷径,如果我们用 0x7 的补码值(~0x7 = 0xFFFF...FFF8)我们使用一个 alu 操作得到 0xFF80(只要编译器和处理器有一种单一的操作码方式来做到这一点,如果没有,您的花费可能会超过 和 减去)。

这似乎就是您的程序正在做的事情。使用 -16 进行与运算与使用 0xFFFF...FFF0 进行与运算相同,结果是在 16 字节边界上对齐的地址。

因此,总而言之,如果您有一个类似典型堆栈指针的东西,它沿着内存从较高地址到较低地址运行,那么您需要

 
sp = sp & (~(n-1))

其中 n 是要对齐的字节数(必须是幂,但这在大多数情况下都可以)对齐通常涉及二的幂)。如果你已经说过完成了一个malloc(地址从低到高增加)并且想要对齐某个东西的地址(记住malloc至少比你需要的对齐大小要多)那么

if(ptr&(~(n-)) { ptr = (ptr+n)&(~(n-1)); }

或者如果你只想把if拿出来然后每次都执行添加和掩码。

许多/大多数非 x86 架构都有对齐规则和要求。就指令集而言,x86 过于灵活,但就执行而言,您可以/将会为 x86 上的未对齐访问付出代价,因此,即使您可以做到这一点,您也应该努力保持对齐,就像使用任何其他架构。也许这就是这段代码正在做的事情。

This does not sound to be stack specific, but alignment in general. Perhaps think of the term integer multiple.

If you have items in memory that are a byte in size, units of 1, then lets just say they are all aligned. Things that are two bytes in size, then integers times 2 will be aligned, 0, 2, 4, 6, 8, etc. And non-integer multiples, 1, 3, 5, 7 will not be aligned. Items that are 4 bytes in size, integer multiples 0, 4, 8, 12, etc are aligned, 1,2,3,5,6,7, etc are not. Same goes for 8, 0,8,16,24 and 16 16,32,48,64, and so on.

What this means is you can look at the base address for the item and determine if it is aligned.

size in bytes, address in the form of 
1, xxxxxxx
2, xxxxxx0
4, xxxxx00
8, xxxx000
16,xxx0000
32,xx00000
64,x000000
and so on

In the case of a compiler mixing in data with instructions in the .text segment it is fairly straightforward to align data as needed (well, depends on the architecture). But the stack is a runtime thing, the compiler cannot normally determine where the stack will be at run time. So at runtime if you have local variables that need to be aligned you would need to have the code adjust the stack programmatically.

Say for example you have two 8 byte items on the stack, 16 total bytes, and you really want them aligned (on 8 byte boundaries). On entry the function would subtract 16 from the stack pointer as usual to make room for these two items. But to align them there would need to be more code. If we wanted these two 8 byte items aligned on 8 byte boundaries and the stack pointer after subtracting 16 was 0xFF82, well the lower 3 bits are not 0 so it is not aligned. The lower three bits are 0b010. In a generic sense we want to subtract 2 from the 0xFF82 to get 0xFF80. How we determine it is a 2 would be by anding with 0b111 (0x7) and subtracting that amount. That means to alu operations an and and a subtract. But we can take a shortcut if we and with the ones complement value of 0x7 (~0x7 = 0xFFFF...FFF8) we get 0xFF80 using one alu operation (so long as the compiler and processor have a single opcode way to do that, if not it may cost you more than the and and subtract).

This appears to be what your program was doing. Anding with -16 is the same as anding with 0xFFFF....FFF0, resulting in an address that is aligned on a 16 byte boundary.

So to wrap this up, if you have something like a typical stack pointer that works its way down memory from higher addresses to lower addresses, then you want to

 
sp = sp & (~(n-1))

where n is the number of bytes to align (must be powers but that is okay most alignment usually involves powers of two). If you have say done a malloc (addresses increase from low to high) and want to align the address of something (remember to malloc more than you need by at least the alignment size) then

if(ptr&(~(n-)) { ptr = (ptr+n)&(~(n-1)); }

Or if you want just take the if out there and perform the add and mask every time.

many/most non-x86 architectures have alignment rules and requirements. x86 is overly flexible as far as the instruction set goes, but as far as execution goes you can/will pay a penalty for unaligned accesses on an x86, so even though you can do it you should strive to stay aligned as you would with any other architecture. Perhaps that is what this code was doing.

命硬 2024-10-09 07:02:49

这与字节对齐有关。某些架构要求用于特定操作集的地址与特定位边界对齐。

也就是说,例如,如果您想要指针进行 64 位对齐,那么您可以从概念上将整个可寻址内存划分为从 0 开始的 64 位块。如果地址完全适合这些块之一,则该地址将被“对齐”;如果它既包含一个块的一部分又包含另一个块的一部分,则该地址将不对齐。

字节对齐(假设数字是 2 的幂)的一个重要特征是地址的最低有效 X 位始终为零。这允许处理器通过简单地不使用底部 X 位来用更少的位来表示更多的地址。

This has to do with byte alignment. Certain architectures require addresses used for a specific set of operations be aligned to specific bit boundaries.

That is, if you wanted 64-bit alignment for a pointer, for example, then you could conceptually divide the entire addressable memory into 64-bit chunks starting at zero. An address would be "aligned" if it fit exactly into one of these chunks, and not aligned if it took part of one chunk and part of another.

A significant feature of byte alignment (assuming the number is a power of 2) is that the least-significant X bits of the address are always zero. This allows the processor to represent more addresses with fewer bits by simply not using the bottom X bits.

是你 2024-10-09 07:02:49

当处理器将数据从内存加载到寄存器时,它需要通过基地址和大小进行访问。例如,它将从地址 10100100 获取 4 个字节。请注意,该示例末尾有两个零。这是因为存储了四个字节,因此 101001 个前导位很重要。 (处理器实际上通过获取 101001XX 来“不关心”来访问这些内容。)

因此,对齐内存中的某些内容意味着重新排列数据(通常通过填充),以便所需项目的地址将具有足够的零字节。继续上面的例子,我们无法从 10100101 中获取 4 个字节,因为最后两位不为零;这会导致总线错误。因此,我们必须将地址增加到 10101000(并在此过程中浪费三个地址位置)。

编译器会自动为您执行此操作并在汇编代码中表示。

请注意,这在 C/C++ 中表现为一种优化:

struct first {
    char letter1;
    int number;
    char letter2;
};

struct second {
    int number;
    char letter1;
    char letter2;
};

int main ()
{
    cout << "Size of first: " << sizeof(first) << endl;
    cout << "Size of second: " << sizeof(second) << endl;
    return 0;
}

输出为

Size of first: 12
Size of second: 8

Rearranging the Two char's 意味着 int 将正确对齐,因此编译器不会不必通过填充来改变基地址。这就是第二个尺寸较小的原因。

When the processor loads data from memory into a register, it needs to access by a base address and a size. For example, it will fetch 4 bytes from address 10100100. Notice that there are two zeros at the end of that example. That's because the four bytes are stored so that the 101001 leading bits are significant. (The processor really accesses these through a "don't care" by fetching 101001XX.)

So to align something in memory means to rearrange data (usually through padding) so that the desired item's address will have enough zero bytes. Continuing the above example, we can't fetch 4 bytes from 10100101 since the last two bits aren't zero; that would cause a bus error. So we must bump the address up to 10101000 (and waste three address locations in the process).

The compiler does this for you automatically and is represented in the assembly code.

Note that this is manifest as an optimization in C/C++:

struct first {
    char letter1;
    int number;
    char letter2;
};

struct second {
    int number;
    char letter1;
    char letter2;
};

int main ()
{
    cout << "Size of first: " << sizeof(first) << endl;
    cout << "Size of second: " << sizeof(second) << endl;
    return 0;
}

The output is

Size of first: 12
Size of second: 8

Rearranging the two char's means that the int will be aligned properly, and so the compiler doesn't have to bump the base address via padding. That's why the size of the second is smaller.

为你拒绝所有暧昧 2024-10-09 07:02:49

想象一下,这个“绘制”

addresses
 xxx0123456789abcdef01234567 ...
    [------][------][------] ...
registers

地址处的值可以轻松地“滑动”到(64位)寄存器

addresses
         56789abc ...
    [------][------][------] ...
registers

当然,寄存器以8字节为步长“行走”

现在,如果您想将值放在地址处将 xxx5 写入寄存器要困难得多:-)


编辑 andl -16

-16 是二进制的 11111111111111111111111111110000 ,

当您“与”任何带有 -16 的内容时,您会得到一个最后 4 位设置为 0 的值...或 16 的倍数。

Imagine this "drawing"

addresses
 xxx0123456789abcdef01234567 ...
    [------][------][------] ...
registers

Values at addresses multiple of 8 "slide" easily into (64-bit) registers

addresses
         56789abc ...
    [------][------][------] ...
registers

Of course registers "walk" in steps of 8 bytes

Now if you want to put the value at address xxx5 into a register is much more difficult :-)


Edit andl -16

-16 is 11111111111111111111111111110000 in binary

when you "and" anything with -16 you get a value with the last 4 bits set to 0 ... or a multitple of 16.

梦冥 2024-10-09 07:02:49

它应该只位于偶数地址,而不是奇数地址,因为访问它们会存在性能缺陷。

It should only be at even addresses, not at the odd ones, because there is a performance deficit accessing them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文