我必须进行一个大学项目,我们必须使用缓存优化来提高给定代码的性能,但是我们不得使用编译器优化来实现它。
我阅读的参考书目的想法之一是将基本块的开始与线路缓存大小保持一致。但是,您可以做类似的事情:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
为了实现我的需求吗?我不知道是否有可能在教学对准方面这样做。我已经看到了一些技巧,例如 _mm_malloc
以实现数据对齐,但没有用于说明。有人可以给我一些问题吗?
I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc
to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
发布评论
评论(1)
tl:dr:这可能不是很有用(因为带有UOP缓存的现代X86通常不在乎代码对齐 1 ),但是在
do {} while() loop
,可以直接使用相同的布局编译到ASM ,在循环的实际顶部之前,没有任何循环设置(序言)指令。 (向后分支的目标)。一般而言, asm(“ foo”); 在函数内部,但在调试模式(
-O0
默认值,又称优化)每个语句(包括asm ();
)按源顺序编译到ASM的单独块。因此,您的情况实际上不需要扩展asm(“。p2align 4” ::::“内存”)
即可订购ASM语句WRT。内存操作。 (同样在最近的GCC中,内存clobber对具有非空模板字符串字符串的基本ASM隐含)。在最坏的情况下,启用了填充物可能会毫无用处而损害性能,但与asm()
的大多数用途不同。这实际上如何编译
这并不能完全奏效; loop在 的 c
。尤其是在使用(a; b; c)
循环的时,在语句
和和 loops 的 是用a
中使用一些先前的初始化!您当然可以在源中撤出,但是GCC的-O0
编译jmp 到底部的条件。
但是,
JMP
仅是一个小(2字节)指令,因此在此之前对齐将使循环的顶部附近 可能的开始获取块,如果那是瓶颈,它仍然会获得大部分利益。 (或接近一组新的UOP-CACHE系列Sandybridge-family x86,其中32字节的边界是相关的。甚至是64字节I-CACHE线路,尽管这很少相关,并且可能导致很多nop被执行to reach that boundary. And bloated code size.)Compiles as follows on the
该循环长5个UOP(假设CMP/JCC的宏观融合),因此,如果一切顺利,可以在Ice Lake或Zen上以1个周期运行。 (每个周期的负载 /商店1 dword的负载不多,并不多,因此应该跟上大型数组,即使它不适合L3 Cahce。)或也许有点一点由于循环缓冲效应更糟。
如果您在Godbolt上使用“二进制”输出模式,则可以看到
lea rbx,[rax+409600]
是7字节的指令,而jmp .l2
是2字节,循环顶部的地址为0x401149
,即在16个字节fetch-block中进行9个字节,在该大小的CPU上。我以32对齐,所以与该块关联的第一个UOP缓存线中只浪费了2个UOP,因此我们在32字节块中仍然相对较好。(Godbolt“二进制”模式将其编译并链接到可执行文件中,并在此上运行
objdump -d
。这也使我们可以看到.p2align
指令>指令扩展为NOP指令某些宽度,或一个以上的宽度,如果必须跳过11个字节以上,则X86-64的默认最大宽度宽度。 ,因此在功能中以及对I-Cache足迹的巨大对齐是一件坏事。)一个相当明显的转换使LEA在
.p2align
。(如果您很好奇,请参见所有这些源版本的Godbolt链接中的ASM)。或
while(p&lt; endp){...; P ++}
也可以做到这一点。 ASM循环的顶部变为以下内容,在循环条件下只有2字节JMP
。因此,这是相当不错的,并获得了大部分好处。可以使用
来实现同一操作(foo = bar,asm(“。p2align 4); p&lt; endp; p ++)
。但是,如果您在第一部分中声明一个变量 运算符无法让您在单独的语句中偷偷摸摸。对于语句的
,逗号 。
代码> 为您放置 的循环
。非零尺寸,no
启用具有优化的循环,如果它无法证明它始终运行至少一次迭代。
p == endp
在第一次迭代之前运行。 )返回; 在循环之前,请在 and and 上放置让GCC为您执行此操作:
-falign-loops = 16
在-O0
gcc -O2 enables
eNables
- falign-loops = 16:11:8
或类似的东西(如果跳过少于11个字节,则按16对齐,否则按8对齐)。这就是为什么GCC使用两个.p2align
指令的序列,对第一个序列限制了限制(请参阅气体手册)。但使用
-falign-loops = 16
在-O0
上什么都不做。 GCC -O0似乎不知道循环是什么。:p,gcc do 尊重
-falign-labels
即使在-O0
。但是不幸的是,这适用于所有标签,包括内部循环内的循环入口点。 godbolt 。在内部最高循环中放置NOP比在现代X86 CPU上的开始要差。
您没有
do {} while()
循环没有这个问题那里。(我使用如何从gcc/clang组装中删除“噪声”输出?对于无需过滤指令的编译选项,其中包括
.p2align
,如果我只想查看Inline ASM去的位置,我可以使用过滤指令可见。
ASM(“ nop #hi mom”)可以通过 在Inline ASM中,具有输入/输出约束。 (,但真的不做; )
脚注1:代码对齐对现代X86无济于事,可能会对其他人有帮助,
即使您确实确实使后退分支的目标保持一致(相反,也不太可能有帮助 不仅仅是一些循环序言);现代X86 CPU带有UOP缓存(Sandybridge-Family and Zen-Family)和Loop Buffer(Nehalem和后来的Intel)不太在乎环路对齐。
它可以对较旧的X86 CPU或其他一些ISA有帮助;只有X86很难解码,以至于UOP缓存是一件事情(您实际上并未指定X86,但是目前大多数人在台式机/笔记本电脑中使用X86 CPU,所以我假设。)
) )分支目标对齐的主要原因有助于(尤其是循环的顶部),是当CPU获取一个16字节的块时,包括 该块将在之后进行,因此将要运行另一个迭代的循环主体的一部分。 (在该提取周期浪费分支目标之前的字节)。
但是,不一致的不一致的最坏情况(除非其他怪异的效果)只是您要花1个额外的前端提取周期,以获取循环主体中的更多说明。 (例如,如果循环的顶部的地址为
0xf
结尾字节在末尾。)可能是cdq
之类的单字节指令,但是管道通常是宽或更多的4个指令。(或在Intel P6早期的早期几天中,在提取,预码(长度查找)和解码之间有缓冲区之前的几天。但是,直到Nehalem的循环缓冲液可以回收一个小环(UOP),以回收一个较小的循环(一个二十二个UOPS),并将其添加到sandybridge添加了多个功能,这些瓶子仍然很大。经常。 href =“ https://www.agner.org/optimize/” rel =“ nofollow noreferrer”> https://www.agner.org/optimize/ 尤其是Agner的Microarch PDF
。当前端(指令获取/解码)带宽是一个问题时最慢的瓶颈,直到缓存缺乏负载后才能等到以后的说明进行填写和解码。 (请参阅 this ,尤其是 href =“ https://www.lighterra.com/papers/modernmicropropersess/” rel =“ nofollow noreferrer”>现代微处理器90分钟的指南!。。
)如果microcode更新禁用了循环缓冲区(LSD),因此在32字节边界上划分的小环形主体可以每2个周期(从2个单独的缓存线中获取UOPS)以最佳1迭代运行。或再次在Skylake上,以这种方式调整代码对齐方式可以帮助避免JCC杂音(如果您不能通过
-wa,-mbranches-within,则可以使代码的一部分从旧版解码而不是UOP缓存运行) -32b-boundaries
使汇编器围绕它工作。 (我如何减轻英特尔JCC杂音对海湾合作委员会的影响?)。这些问题特定于Skylake衍生的微体系结构,并固定在Ice Lake中。当然,反优化的调试模式代码是如此肿,以至于即使是紧密的循环也不太可能少于8个UOPS,因此32个字节边界的问题可能不会受到太大伤害。但是,如果您通过在本地vars上使用
register
避免存储/重新加载延迟瓶颈(是的(是的,这在debug build build holding 中都会做,否则它是毫无意义的 1 ),如果内部循环最终在内部或有条件的分支底部或位于jcc沟中,则在Skylake CPU上获得所有这些效率低下的说明的前端瓶颈很可能会受到Skylake CPU的影响。循环结束。无论如何,正如Eric评论的那样,您的作业可能更多地是关于数据访问模式,并且可能是布局和对齐。据推测,由于L2或L3高速缓存的错过是唯一足够慢的东西,因此涉及大量内存的小环路,而不是在禁用了优化的建筑物中更像是瓶颈。也许在某些情况下L1D,如果您设法让编译器使不易rable ASM进行调试模式,或者如果负载使用延迟(不仅仅是吞吐量)是关键路径的一部分。
脚注2:
-O0
是愚蠢的,但是注册int i
可以帮助查看
c循环优化帮助最终分配(具有编译器)禁用的优化) re:为调试模式优化源代码或以正常用例的方式进行基准测试是多么愚蠢。但是,还提到了某些对于这种情况(与普通构建不同)更快的事情,例如在单个语句或表达中做更多的事情,因为编译器不会在跨语句中保留寄存器中的内容。
(另请参见为什么clang clang clang使用-O0产生效率低下的ASM(对于此简单的浮点总和)?以获取详细信息)
除
register
变量;该过时的关键字仍然可以使用GCC(但不叮当)来实现不可优化的构建。在最近的C ++版本中,它已正式弃用甚至删除,但尚未c。您绝对想使用
Int i
让调试构建将其保存在寄存器中,并像手工编写的ASM一样写下您的C。例如,在适当的情况下,使用指针增量代替arr [i]
,特别是对于没有索引地址模式的ISA。寄存器
变量在您的内部循环中最重要,并且在禁用优化后,编译器可能并不是很聪明,因为决定哪个register
var实际上是寄存器,如果登记量已耗尽。 (X86-64除堆栈指针以外的15个整数REG,并且调试构建将将其中一个用于帧指针。)尤其是对于更改内部循环的变量,以避免存储/重新加载潜伏期瓶颈,例如
(注册int i = 1000000; - i;);
可能在每个时钟运行1次迭代,而没有register> register
在现代x86-- 64 CPU喜欢Skylake。如果使用整数变量作为数组索引,请将其
intptr_t
或uintptr_t
(#include&lt; stdint.h&gt;
)编译器不必从32位int
换成64位指针宽度以用于寻址模式。(除非您要编译Aarch64,否则其通话模式为64位寄存器和32位寄存器,执行符号或零扩展名,并且忽略了窄整数中的高垃圾。这完全是因为这是编译器可以' t总是优化。 “ rel =“ nofollow noreferrer”>允许编译器扩大整数循环变量或转换为指针增量。)
也很松散相关:在Intel Sandybridge-family CPU中为管道的程序进行除去有意地通过缓存效应使事情缓慢地使事情变得缓慢,相反。可能不是很适合,idk您的问题是什么样的。
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a
do{}while()
loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic
asm("foo");
inside a function, but in debug mode (the-O0
default, aka optimizations disabled) each statement (includingasm();
) compiles to a separate block of asm in source order. So you case doesn't actually need Extendedasm(".p2align 4" ::: "memory")
to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses ofasm()
.How this actually compiles
This does not exactly work; a C
for
loop compiles to some asm instructions before the asm loop. Especially when using afor(a;b;c)
loop with some before-first-iteration initialization in statementa
! You can of course pull that out in the source, but GCC's-O0
strategy for compilingwhile
andfor
loops is to enter the loop with ajmp
to the condition at the bottom.But that
jmp
alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)Compiles as follows on the Godbolt compiler explorer. Note that the way I used
register
meant I got not-terrible asm despite the debug build, and didn't have to combinep++
intop++ <= endp
or*(p++) += 123;
to make store/reload overhead less bad (because there isn't any in the first place forregister
locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that
lea rbx, [rax+409600]
is a 7-byte instruction, whilejmp .L2
is 2 bytes, and that the address of the top of the loop is0x401149
, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.(Godbolt "binary" mode compiles and links into an executable, and runs
objdump -d
on that. That also lets us see the.p2align
directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)A fairly obvious transformation gets the LEA before the
.p2align
. (See the asm in the Godbolt link for all of these source versions if you're curious).Or
while (p < endp){... ; p++}
also does the trick. The top of the asm loop becomes the following, with only a 2-bytejmp
to the loop condition. So this is pretty decent, and gets most of the benefit.It might be possible to achieve the same thing with
for(foo=bar, asm(".p2align 4) ; p<endp ; p++)
. But if you're declaring a variable in the first part of afor
statement, the comma operator won't work to let you sneak in a separate statement.To actually align the asm loop, we can write it as a
do{}while
.No
jmp
at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try-falign-labels=32
to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at-O0
.)Since I'm hard-coding a non-zero size, no
p == endp
check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could doif(n==0) return;
before the loop. Or more generally, put the loop inside anif
like GCC does when compiling afor
orwhile
loop with optimization enabled, if it can't prove that it always runs at least one iteration.Getting GCC to do this for you:
-falign-loops=16
doesn't work at-O0
GCC
-O2
enables-falign-loops=16:11:8
or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two.p2align
directives, with a padding limit on the first one (see the GAS manual).But using
-falign-loops=16
does nothing at-O0
. It seems GCC -O0 doesn't know what a loop is. :PHowever, GCC does respect
-falign-labels
even at-O0
. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a
do{}while()
loop, but in that case it also seems to work to useasm()
to put an alignment directive there.(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include
.p2align
. If I just wanted to see where the inline asm went, I could have usedasm("nop #hi mom")
to make it visible with directives filtered out.)If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with
0xf
, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction likecdq
, but pipelines are often 4 instructions wide, or more.(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass
-Wa,-mbranches-within-32B-boundaries
to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using
register
on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2:
-O0
is dumb, butregister int i
can helpSee
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except
register
variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.You definitely want to use
register int i
to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead ofarr[i]
where appropriate, especially for ISAs that don't have an indexed addressing mode.register
variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding whichregister
var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g.
for(register int i=1000000 ; --i ; );
probably runs 1 iteration per clock, vs. 5 or 6 withoutregister
on a modern x86-64 CPU like Skylake.If using an integer variable as an array index, make it
intptr_t
oruintptr_t
(#include <stdint.h>
) if possible, so the compiler doesn't have to redo sign-extension from 32-bitint
to 64-bit pointer width for use in addressing modes.(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.