L1缓存是否接受新的传入请求,而其线条填充缓冲区(LFB)已充分耗尽?

发布于 2025-01-28 09:26:52 字数 209 浏览 1 评论 0原文

我想知道L1缓存是否仍会收到触时L1D的新请求,当线条填充缓冲区(LFBS或MSHR)满足时,管道的前进进度?

还是有人可以帮助我写一个可以告诉我是否这样的微型计算标?

我知道如何使LFB饱和(IE大步间接访问,哈希表,...),现代Intel CPU中有一些有用的性能计数器可用于测量L1 MLP并计算FB的完整事件,但仍然无法弄清楚是否是否弄清楚是否是否知道它是否这样做。

I wonder if L1 cache still receives new requests that hit L1D, making forward progress for the pipeline when Line Fill Buffers (LFBs or MSHRs) get full?

Or could anybody help me write a microbenchmark that can tell me if it does or not?

I know how to saturate LFBs (i.e. stride indirect accesses, hash table, ...) and there are a few useful performance counters available in modern Intel CPUs to measure L1 MLP and to count FB full events but still couldn't figure out if it does so or not.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜光 2025-02-04 09:26:52

无论LFB如何,都可以在L1D中击中载荷,无相互作用。

如果所有LFB都被占据,甚至在等待其他负载时,如果负载执行单元无法在L1D中撞击,我会感到震惊。负载必须窥探LFB(要查看最近的NT商店,或者在有限条件下等待RFO时致力于LFB的常规商店,而这种情况可能会发生而不会违反内存订购规则),但没有负载需要在L1D命中分配一个。

但是,尝试测试一个假设总是很有趣的,实际上,通过实验测试理论是一个好主意。在这种情况下,我认为我发现相当不错的证据表明L1D负载命中率仍然可以发生,而所有LFB都满(有部分nt nt商店)。

抱歉,这个答案有点乱七八糟,代码位于底部。在选择实际显示哪一个时,我玩了一堆迭代的数量,并且没有回到整理措辞。因此,如果有任何好处,您可能可以获得考虑实验并进行完善的过程的一些经验。 :p


L1D负载命中率很高(SNB开始时为2/时钟,Alder Lake的3/Clock)。但是,很难将其与LFB用完的瓶颈上的瓶颈区分开。也许在mov rax,[rax]中查看L1D负载的延迟,可以更轻松地检测到丢失的周期进展而不会远离其他吞吐量限制? (并从周期中偷偷偷偷地使有限的RS / ROB大小“持续更长”。)

或者,也许我们应该避免尝试将所有LFB占据AS 稳定状态< / em>条件< / em>条件,因为试图与负载潜伏期瓶颈之间的平衡很难与商店区分开,而只是成为真正的吞吐量瓶颈。

取而代之的是,偶尔会偶尔或其他将占据全部10、12或您在任何一代Intel CPU中拥有的许多LFB的东西。借助商店缓冲区和吸收NT商店的爆发,我们可以填补当时的所有LFBS ,而无需期望创建整体吞吐量瓶颈或潜伏期dep链中的任何气泡。因此,CPU可以吸收爆发,并让前端从我们的DEP链中发出UOPS。

NT商店是一个不错的选择:他们需要LFB,直到被交出为止,部分nt nt商店我们从未完成的将坐在LFB中,直到被驱逐以腾出更多空间。 (当NT商店确实在缓存线中写下所有字节时,它会自行融合;这是普通用例。)


perf stat Stat测量整个程序,但没有在用户空间中运行的其他代码,启动开销很小。只有几页故障。让它持续一段时间,接近一秒钟,这意味着几毫秒的时钟速度可以跳至满是可以忽略的。

在Arch GNU/Linux上的i7-6700k Skylake(带有DDR4-2666内存)上,带有Energy_Performance_Preference = Balance_Performance,它仅为3.9GHz,而不是持续时间的4.2,这使粉丝保持几乎差不多。但是,它会非常迅速地升至该速度,并且可以无限期地将其维护在所有内核上,因此会中断其他核心和东西不会打扰事物。

用部分线NT商店测试到32条连续线。 (作为100个Iters x 8 rep/imul dep链中的100次元素,该循环的17 uops的一系列商店活动)。 请参阅此答案底部的源。我后来以一个较短的DEP链延迟进行,因此商店活动的爆发可能会与更多的总运行时间重叠,而不会长时间停滞不前。因此,如果他们要产生效果,那将是更明显的。

$ asm-link -dn "$t".asm -DNTSTORE_ITERS=18
+ nasm -felf64 -Worphan-labels lfb-test.asm -DNTSTORE_ITERS=18
+ ld -o lfb-test lfb-test.o
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,ld_blocks_partial.address_alias,ld_blocks.store_forward -r3 ./"$t"

 Performance counter stats for './lfb-test' (3 runs):

          1,647.24 msec task-clock                #    1.000 CPUs utilized            ( +-  0.02% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    1.214 /sec                   
     6,421,016,156      cycles                    #    3.897 GHz                      ( +-  0.00% )
     1,895,000,506      instructions              #    0.30  insn per cycle           ( +-  0.00% )
           113,936      exe_activity.bound_on_stores #   69.158 K/sec                    ( +- 50.67% )
           163,512      resource_stalls.sb        #   99.250 K/sec                    ( +- 44.22% )
                 0      ld_blocks_partial.address_alias #    0.000 /sec                   
                 0      ld_blocks.store_forward   #    0.000 /sec                   

          1.647758 +- 0.000279 seconds time elapsed  ( +-  0.02% )

因此,64.21亿周期而不是6400m意味着我们几乎没有到达OOO Exec开始在负载/IMUL DEP链上失去一些进度的地步,这可能是由于RS(调度程序)大小有限。 (请参阅了解Lfence对两个长依赖链的循环的影响,以增加长度,以分析这种对长DEP链的影响)。

0 ld_blocks计数表明,我成功地避免了4K与我选择的指针聊天地址mov rax,[rax] vs. vs.缓冲区的方式。


仅存储零件

我们可以单独测试商店,以确保如果没有重叠,它们将在总时间范围内取得不可忽略的分数。我们想验证工作负载的存储部分不比Alu Dep链快100倍,在这种情况下,即使它拖延了潜伏期DEP链,它也可能会丢失。

我编辑了LOAD/IMUL链以使用MOV ECX,1%rep 0,因此只有一个未击倒的DEC/JNZ。

# no latency dep-chain
# NTSTORE_ITERS=16  (32 NT stores to 32 cache lines)

$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            411.00 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.863 /sec                   
     1,601,892,487      cycles                    #    3.895 GHz                      ( +-  0.00% )
        87,000,133      instructions              #    0.05  insn per cycle           ( +-  0.00% )
     1,567,641,964      exe_activity.bound_on_stores #    3.812 G/sec                    ( +-  0.01% )
     1,567,641,964      resource_stalls.sb        #    3.812 G/sec                    ( +-  0.01% )
               405      br_misp_retired.all_branches_pebs #  984.826 /sec                     ( +- 10.91% )
            16,606      int_misc.recovery_cycles_any #   40.380 K/sec                    ( +-  8.02% )

          0.411499 +- 0.000250 seconds time elapsed  ( +-  0.06% )

总周期与-dntStore_iters = n从大约9上向上缩放,exe_activity.bound_on_storesresource_stalls.sbs.sbs.sb本质上等于cycles

最后两个计数器是测量分支的错过,而总的前端周期却失去了重新转向分支机构诸如分支机构的恢复。
分支的错过通常可以在19个内部循环迭代或较低的情况下可以忽略不计,但是在21或更高时,我们每次都会在每次外部迭代中得到一个错误预测的错误,即每次内部迭代。

对于ntstore_iters = 6或更低,它更快(1m外迭代的14m循环= 12m nt商店),这是有道理的,因为 <> Skylake具有12 lfbs 。 NT商店在同一部分LFB中遇到,不需要驱逐任何东西,因此没有核心瓶颈。 n = 7(14行)需要〜390m循环,n = 8(16行)需要〜600m +-30m的周期。对于n = 10(20行),我们得到990m循环。

当负载DEP链 IS 运行时,这种非常快的速度将保持n = 6。外迭代数量增加了100倍。总时间= 1600m循环,2.56 IPC。与1400m循环,DEP链缩短了更多,只是在商店吞吐量上绑定。我认为,如果负载完全干扰了LFB,那会使它慢得多。我不知道为什么要在12个NT商店中需要14个周期。

# dep chain: ECX=1 / %rep 2
# stores: NTSTORE_ITERS=6  (12 lines, same numbers of LFBs)
# outer iterations: 100M instead of the 1M in other tests.

 Performance counter stats for './lfb-test' (3 runs):

            410.56 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 2      page-faults               #    4.868 /sec                   
     1,600,236,855      cycles                    #    3.895 GHz                      ( +-  0.00% )
     4,100,000,135      instructions              #    2.56  insn per cycle           ( +-  0.00% )
            92,188      exe_activity.bound_on_stores #  224.404 K/sec                    ( +- 54.94% )
       675,039,043      resource_stalls.sb        #    1.643 G/sec                    ( +-  0.01% )

因此,要占据所有LFB,我们应该至少使用20条缓存线,也可能会使用32(n = 16)。它足够短,不会引起分支机构,或者如果我们有时间在两者之间排水,则可以填充商店缓冲区或调度程序。但是足够长的时间比LFB的数量要多,因此我们当然有很多周期。

如果IDK只是核心时钟和内存时钟的巧合,但是n = 16(32 nt商店)的情况几乎需要我创建的Load / alu dep链的一半。在每家32个NT商店的1M外迭代均进行32个NT存储时,就吞吐量成本而言,每32​​ nt商店的周期约为 50个周期。他们以1/时钟执行执行单元的执行,因此与一个人需要多长时间相比,其中32次可以非常快地进入商店缓冲区。

(当然,在其他级别的缓存层次结构上有一些缓冲区,例如L2和Ring Bus之间的“ Superqueue”。因此,当NT商店爆发时,它们首先可能会比那更快地移交。除非它无论如何,直到被驱逐作为部分写作之前。)

无论如何,n = 16的32个缓存线碰到的n = 16时,仅在进行商店时就需要Alu Dep链的一半时间。而且它已经足够爆发了,几乎可以肯定,它占据了所有LFB,占商店爆发的50%“占空比”的不错的一部分。

当然,当我们与负载/IMUL链并行这样做时,它们会被占用超过两百分之几的放缓。该DEP链需要每8个周期完成一次负载,并且不能在突发中“赶上”。每当准备好负载地址,但负载不会执行该周期,吞吐量丢失并且无法捕获,因为这是关键路径延迟瓶颈的工作方式。

除非CPU保留LFB的负载,否则如果需要以某种方式进行负载。我认为这不太可能。

减少Alu Dep链,因此它的长度也长16m,长度与n = 16的商店吞吐量瓶颈相同,结合在一起,它们仍然完美重叠。据推测,这需要所有LFB来维持该商店吞吐量,这是它们独立的可靠证据。

匹配的瓶颈:延迟和商店通行证与内部迭代完美重叠,其

# dep chain iters = 10  x  %rep 20         - alone takes 1.6G cycles
# NTSTORE_ITERS=16                         - alone takes 1.602G cycles
#                                        together taking 1.621G cycles
$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            416.10 msec task-clock                #    0.997 CPUs utilized            ( +-  0.15% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.797 /sec                   
     1,621,634,284      cycles                    #    3.890 GHz                      ( +-  0.02% )
       505,000,135      instructions              #    0.31  insn per cycle           ( +-  0.00% )
           575,600      exe_activity.bound_on_stores #    1.381 M/sec                    ( +- 75.50% )
         1,298,930      resource_stalls.sb        #    3.116 M/sec                    ( +- 47.96% )
             1,376      br_misp_retired.all_branches_pebs #    3.301 K/sec                    ( +-113.51% )
            94,101      int_misc.recovery_cycles_any #  225.719 K/sec                    ( +-256.14% )

          0.417209 +- 0.000704 seconds time elapsed  ( +-  0.17% )

长度是两倍,因此他们独自执行约3200m循环(只是load/imul或仅商店),dntStore_iters = 29 is很好,仍然3289m周期。 n = 31给出35.65m的周期。但是碰到n = 32(64个缓存线)使性能从悬崖上脱落:4920m循环。我不知道是什么原因造成的。也许是某种抢劫尺寸或商店缓冲限制? exe_activity.bound_on_storesresource_stalls.sb并没有急剧上升。


Linux静态可执行文件构建的NASM源

,使用nasm -felf64 lfb -test.asm -dntStore_iters = 16&amp;&amp;&amp; ld -o lfb-test lfb-test.o

我在最终测试中使用的计数常数,该测试显示出与DEP链接近完美的重叠,并且存储吞吐量均为1600个周期,每个iTer均为1600个周期。较早的Perf实验来自DEP链的%REP 40的版本,或者MOV ECX,100%rep 8输出6,421,016,156循环。

global _start
_start:
 and     rsp, -4096         ; help avoid 4k aliasing between load chain and stores
 mov     rax, rsp           ; do our pointer chasing far from the buffer, overwriting argc
 mov    [rax], rax
 vpaddd  ymm0, ymm1, ymm2     ; sometimes unwritten registers can be weird

 mov ebp, 1000000    ; outer repeat count

.loop:
  mov   ecx, 10       ; low iter count to avoid a mispredict
  .inner:
  %rep 20             ; unroll 20x (5+3 cycles) = 160 cycle dep chain
   mov  rax, [rax]
   imul rax, rax, 1   ; lengthen the dep chain without memory access.  And defeat the special case load latency thing in some Intel CPUs so it's always 5 cycles
  %endrep
   dec  ecx
   jnz  .inner

%ifndef NTSTORE_ITERS
%define NTSTORE_ITERS 16
%endif
  mov  ecx, NTSTORE_ITERS
  lea  rdi, [rel buf+64]            ; start at 2nd cache line of the page to avoid 4k aliasing unless we go really far
  .store_burst:                     ; 16 x2 cache lines of NT stores
   vmovntdq [rdi+ 0], ymm0        
   ;vmovntdq [rdi+32], ymm0          ; partial line NT stores take much longer to release their LFB, so we get more consumption for fewer uops
   vmovntdq [rdi+64], ymm0
   ;vmovntdq [rdi+96], ymm0
   add  rdi, 128
   dec  rcx
   jnz  .store_burst

 dec ebp
 jnz .loop


 mov  eax, 231       ; Linux _NR_exit_group
 xor  edi, edi
 syscall             ; _exit(0)

section .bss
 align 4096
 buf: resb 128 * 4096

我可能不需要使用avx2;传统SSE movntDQ [rdi+64],XMM0也可以使用,编写前16个而不是缓存线的32个字节。


有用的perf计数器事件(perf Perf list的描述)

  • exe_activity.bound_on_stores - [循环商店缓冲区已满,没有出色的负载]
    如果在商店缓冲区已满时,CPU赶上了负载链上,我们将获得计数。如果前端有空间在回到循环的那部分之后,可以发出更多的负载/imul。

  • resource_stalls.sb - [由于没有可用的商店缓冲区而导致的循环停滞。 (不包括排水形式同步)]
    我认为,当前端无法供应商店/重命名商店时,我认为没有任何商店缓冲区条目。 (是的,这些是在发行/重命名期间分配的,而不是商店执行时。我认为,即使是一个未对准的商店也只使用一个商店缓冲区条目,在TLB检查期间和承诺缓存时,额外的处理都会发生)

  • LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - [由于地址的部分比较而导致的MOB中的false依赖关系]
    这是我想避免作为混杂因素的4K混叠。

  • br_misp_retired.all_branches - [所有错误预测的宏观分支指令退休]
    计算错过多少个分支说明

  • int_misc.recovery_cycles_any - [核心循环分配器由于从早期恢复而停滞不前
    在物理核心上运行的任何线程的清晰事件(例如
    错误预测或内存nuke)

    计算分支机构错过(和其他任何摊位)的前端罚款 - 只要低,它就不是任何慢速运行的原因。

Loads can hit in L1d regardless of LFB, no interaction.

I'd be shocked if load execution units couldn't hit in L1d while all LFBs were occupied, even waiting for other loads. A load has to snoop LFBs (to see recent NT stores, or regular stores that have committed to an LFB while waiting for an RFO under the limited conditions where that can happen without violating the memory-ordering rules), but a load doesn't need to allocate one on an L1d hit.

It's always fun to try to test an assumption, though, and in fact a good idea to test theory with experiment. In this case, I think I've found fairly good evidence that L1d load hits can still happen while all LFBs are full (with partial-line NT stores).

Sorry this answer is a bit rambling, with the code at the bottom. I played around with a bunch of iteration counts before choosing which one to actually show, and didn't go back to tidy up the wording. So you can maybe get some experience of the process of thinking up an experiment and refining it, if there's any upside to this. :P


L1d load hits have very high throughput (2/clock from SnB onward, 3/clock in Alder Lake). But it would be hard to distinguish a bottleneck on that from a bottleneck on whatever runs out of LFBs. Perhaps looking at latency of L1d loads in a pointer-chasing scenario like mov rax, [rax] could more easily detect lost cycles progress without staying far from other throughput limits? (And making the limited RS / ROB size "last longer" in terms of cycles to sneak some stores in.)

Or maybe we should avoid trying to push close to having all LFBs occupied as a steady state condition, because trying to balance that with a load latency bottleneck would be hard to distinguish from the stores on their own just becoming a real throughput bottleneck.

Instead do a burst of NT stores occasionally, or something else that will occupy all 10, 12, or however many LFBs you have in whatever generation of Intel CPU. With the store buffer as well to absorb that burst of NT stores, we can fill all the LFBs some of the time, without expecting to create an overall throughput bottleneck or any bubbles in the latency dep-chain. So the CPU can absorb the burst and have the front-end get back to issuing uops from our dep chain.

NT stores are a good choice: they need LFBs until being handed off, and partial-line NT stores that we never complete will sit in an LFB until evicted to make room for more. (When NT stores do write all the bytes in a cache line, it flushes itself; that's the normal use-case.)


perf stat measures the whole program, but with no other code running in user-space, startup overhead is minimal. Only a couple page faults. Letting it run for a good while, close to a second, means the few milliseconds for clock speed to jump to full is negligible.

On i7-6700k Skylake (with DDR4-2666 memory) on Arch GNU/Linux, with energy_performance_preference = balance_performance, it only goes to 3.9GHz, not 4.2 for sustained periods, which keeps the fans near-silent. But it ramps to that speed very quickly, and can maintain it on all cores indefinitely, so interrupts on other cores and stuff don't disturb things.

Tested with partial-line NT stores to 32 contiguous lines. (As a burst of store activity between 100 iters x 8 reps of mov/imul dep chain, 100x 17 uops of that loop). See the source at the bottom of this answer. I later ended up going with a somewhat shorter dep chain so bursts of store activity could overlap with for more of the total run time without being so long it stalls. So if they were going to have an effect, it would be more visible.

$ asm-link -dn "$t".asm -DNTSTORE_ITERS=18
+ nasm -felf64 -Worphan-labels lfb-test.asm -DNTSTORE_ITERS=18
+ ld -o lfb-test lfb-test.o
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,ld_blocks_partial.address_alias,ld_blocks.store_forward -r3 ./"$t"

 Performance counter stats for './lfb-test' (3 runs):

          1,647.24 msec task-clock                #    1.000 CPUs utilized            ( +-  0.02% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    1.214 /sec                   
     6,421,016,156      cycles                    #    3.897 GHz                      ( +-  0.00% )
     1,895,000,506      instructions              #    0.30  insn per cycle           ( +-  0.00% )
           113,936      exe_activity.bound_on_stores #   69.158 K/sec                    ( +- 50.67% )
           163,512      resource_stalls.sb        #   99.250 K/sec                    ( +- 44.22% )
                 0      ld_blocks_partial.address_alias #    0.000 /sec                   
                 0      ld_blocks.store_forward   #    0.000 /sec                   

          1.647758 +- 0.000279 seconds time elapsed  ( +-  0.02% )

So 6421M cycles instead of 6400M means we're just barely getting to the point where OoO exec starts to lose a bit of progress on the load/imul dep chain, maybe due to limited RS (scheduler) size. (See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for analysis of this kind of impact on a long dep chain).

The 0 ld_blocks counts show that I successfully avoided 4k aliasing with the way I chose addresses for the pointer-chasing mov rax,[rax] vs. the buffer.


Store part only

We can test the stores alone to make sure they'd take a non-negligible fraction of the total time if there wasn't overlap. We want to verify that the store part of the workload isn't like 100x faster than the ALU dep chain, in which case it might be lost in the noise even if it was stalling the latency dep chain.

I edited the load/imul chain to use mov ecx,1 and %rep 0, so just one not-taken dec/jnz.

# no latency dep-chain
# NTSTORE_ITERS=16  (32 NT stores to 32 cache lines)

$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            411.00 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.863 /sec                   
     1,601,892,487      cycles                    #    3.895 GHz                      ( +-  0.00% )
        87,000,133      instructions              #    0.05  insn per cycle           ( +-  0.00% )
     1,567,641,964      exe_activity.bound_on_stores #    3.812 G/sec                    ( +-  0.01% )
     1,567,641,964      resource_stalls.sb        #    3.812 G/sec                    ( +-  0.01% )
               405      br_misp_retired.all_branches_pebs #  984.826 /sec                     ( +- 10.91% )
            16,606      int_misc.recovery_cycles_any #   40.380 K/sec                    ( +-  8.02% )

          0.411499 +- 0.000250 seconds time elapsed  ( +-  0.06% )

Total cycles scales linearly with -DNTSTORE_ITERS=n from about 9 upward, with exe_activity.bound_on_stores and resource_stalls.sb essentially equal to cycles.

The last two counters are measuring branch misses, and total front-end cycles lost to re-steer and other recovery from things like branch misses.
Branch misses are usually negligible at 19 inner loop iterations or lower, but at 21 or higher we get almost exactly one mispredict per outer-loop iteration, i.e. the last inner iteration every time.

For NTSTORE_ITERS=6 or lower, it's vastly faster (14M cycles for 1M outer iterations = 12M NT stores), which makes sense because Skylake has 12 LFBs. NT stores hit in the same partial LFB, not needing to evict anything, so there's no off-core bottleneck. n=7 (14 lines) takes ~390M cycles, n=8 (16 lines) takes ~600M +- 30M cycles. For n=10 (20 lines) we get 990M cycles.

This extremely fast speed with n=6 holds up when the load dep chain is running. e.g. latency chain of ecx=1 rep 2, store work of n=6. Outer iteration count bumped up by a factor of 100. Total time = 1600M cycles, 2.56 IPC. vs. 1400M cycles with the dep chain shortened even more, just bound on store throughput. I think if loads were disturbing LFBs at all, that would make it much slower. I don't know why it takes 14 cycles for 12 NT stores.

# dep chain: ECX=1 / %rep 2
# stores: NTSTORE_ITERS=6  (12 lines, same numbers of LFBs)
# outer iterations: 100M instead of the 1M in other tests.

 Performance counter stats for './lfb-test' (3 runs):

            410.56 msec task-clock                #    0.999 CPUs utilized            ( +-  0.06% )
                 2      page-faults               #    4.868 /sec                   
     1,600,236,855      cycles                    #    3.895 GHz                      ( +-  0.00% )
     4,100,000,135      instructions              #    2.56  insn per cycle           ( +-  0.00% )
            92,188      exe_activity.bound_on_stores #  224.404 K/sec                    ( +- 54.94% )
       675,039,043      resource_stalls.sb        #    1.643 G/sec                    ( +-  0.01% )

So to occupy all the LFB for most of the time, we should be using at least 20 cache lines, might as well go for 32 (n=16). It's short enough not to cause branch misses, or to fill up the store buffer or scheduler if we give it time to drain in between. But long enough to be way more than the number of LFBs, so we certainly have lots of cycles where they're all occupied.

IDK if it's just a coincidence of core clock and memory clock, but that n=16 (32 NT stores) case takes almost exactly half the time of the load / ALU dep chain I created. With 1M outer iterations doing 32 NT stores each, that's about 1602 cycles per 32 NT stores, or 50 cycles per partial-line NT store in terms of throughput cost. They execute on the execution units at 1/clock, so a burst of 32 of them can get into the store buffer really quickly compared to how long it takes one to commit.

(Of course, there are buffers at other levels of the cache hierarchy, like the "superqueue" between L2 and the ring bus. So when NT stores are coming in bursts, they first of them can likely hand off faster than that. Except it won't even try until it's getting evicted as a partial-line write.)

Anyway, n=16 for 32 cache lines touched takes half the time of the ALU dep chain, when doing just the stores. And it's bursty enough that it's almost certainly occupying all the LFBs for a decent fraction of that 50% "duty cycle" of store bursts.

Certainly they'd be occupied for well over the couple percent slowdown we see when we're doing this in parallel with the load/imul chain. That dep chain needs to complete a load every 8 cycles, and can't "catch up" in bursts. Any time a load address is ready but the load doesn't execute that cycle, throughput is lost and can't be caught up, because that's how critical path latency bottlenecks work.

Unless the CPU reserves an LFB for loads, if they somehow need one. I think that's unlikely.

Reducing the ALU dep chain so it's also 16M cycles long, same length as the store throughput bottleneck with n=16, combined they still overlap perfectly. This presumably needs all the LFBs to maintain that store throughput, which is pretty solid evidence that they're independent.

Matched bottlenecks: Latency and store-throughput overlap near perfectly

# dep chain iters = 10  x  %rep 20         - alone takes 1.6G cycles
# NTSTORE_ITERS=16                         - alone takes 1.602G cycles
#                                        together taking 1.621G cycles
$ t=lfb-test; asm-link -dn "$t".asm -DNTSTORE_ITERS=16 && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,br_misp_retired.all_branches_pebs,int_misc.recovery_cycles_any -r3 ./"$t"

            416.10 msec task-clock                #    0.997 CPUs utilized            ( +-  0.15% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 2      page-faults               #    4.797 /sec                   
     1,621,634,284      cycles                    #    3.890 GHz                      ( +-  0.02% )
       505,000,135      instructions              #    0.31  insn per cycle           ( +-  0.00% )
           575,600      exe_activity.bound_on_stores #    1.381 M/sec                    ( +- 75.50% )
         1,298,930      resource_stalls.sb        #    3.116 M/sec                    ( +- 47.96% )
             1,376      br_misp_retired.all_branches_pebs #    3.301 K/sec                    ( +-113.51% )
            94,101      int_misc.recovery_cycles_any #  225.719 K/sec                    ( +-256.14% )

          0.417209 +- 0.000704 seconds time elapsed  ( +-  0.17% )

With the inner iterations twice as long, so they each execute about 3200M cycles on their own (just load/imul or just stores), DNTSTORE_ITERS=29 is fine, still 3289M cycles. And n=31 gives 3565M cycles. But bumping up to n=32 (64 cache lines) makes performance fall off a cliff: 4920M cycles. I don't know what causes this; maybe some kind of ROB-size or store-buffer limit? exe_activity.bound_on_stores and resource_stalls.sb didn't go up dramatically.


NASM source for Linux static executable

Build with nasm -felf64 lfb-test.asm -DNTSTORE_ITERS=16 && ld -o lfb-test lfb-test.o

The counts constants in this are what I used for the final test that showed near-perfect overlap with dep chain and store throughput both being 1600 cycles per outer iter. Earlier perf experiments were from versions with %rep 40 for the dep chain, or for mov ecx,100 and %rep 8 in the first perf output with 6,421,016,156 cycles.

global _start
_start:
 and     rsp, -4096         ; help avoid 4k aliasing between load chain and stores
 mov     rax, rsp           ; do our pointer chasing far from the buffer, overwriting argc
 mov    [rax], rax
 vpaddd  ymm0, ymm1, ymm2     ; sometimes unwritten registers can be weird

 mov ebp, 1000000    ; outer repeat count

.loop:
  mov   ecx, 10       ; low iter count to avoid a mispredict
  .inner:
  %rep 20             ; unroll 20x (5+3 cycles) = 160 cycle dep chain
   mov  rax, [rax]
   imul rax, rax, 1   ; lengthen the dep chain without memory access.  And defeat the special case load latency thing in some Intel CPUs so it's always 5 cycles
  %endrep
   dec  ecx
   jnz  .inner

%ifndef NTSTORE_ITERS
%define NTSTORE_ITERS 16
%endif
  mov  ecx, NTSTORE_ITERS
  lea  rdi, [rel buf+64]            ; start at 2nd cache line of the page to avoid 4k aliasing unless we go really far
  .store_burst:                     ; 16 x2 cache lines of NT stores
   vmovntdq [rdi+ 0], ymm0        
   ;vmovntdq [rdi+32], ymm0          ; partial line NT stores take much longer to release their LFB, so we get more consumption for fewer uops
   vmovntdq [rdi+64], ymm0
   ;vmovntdq [rdi+96], ymm0
   add  rdi, 128
   dec  rcx
   jnz  .store_burst

 dec ebp
 jnz .loop


 mov  eax, 231       ; Linux _NR_exit_group
 xor  edi, edi
 syscall             ; _exit(0)

section .bss
 align 4096
 buf: resb 128 * 4096

I probably didn't need to use AVX2; a legacy SSE movntdq [rdi+64], xmm0 would have worked just as well, writing the first 16 instead of 32 bytes of a cache line.


Useful perf counter events (descriptions from perf list)

  • exe_activity.bound_on_stores - [Cycles where the Store Buffer was full and no outstanding load].
    If the CPU catches up on the load chain while the store buffer is full, we'll get counts for this. If there's room for the front-end to issue more loads/imuls after getting back to that part of the loop.

  • resource_stalls.sb - [Cycles stalled due to no store buffers available. (not including draining form sync)]
    Counts I think when the front-end can't alloc/rename a store because there aren't any store buffer entries left. (Yes, those are allocated during issue/rename, not when the store executes. That I think implies that even a misaligned store only uses one store buffer entry, with the extra handling happening during TLB check and when committing to cache)

  • ld_blocks_partial.address_alias - [False dependencies in MOB due to partial compare on address]
    This is the 4k aliasing that I wanted to avoid as a confounding factor.

  • br_misp_retired.all_branches - [All mispredicted macro branch instructions retired]
    count how many branch instructions missed

  • int_misc.recovery_cycles_any - [Core cycles the allocator was stalled due to recovery from earlier
    clear event for any thread running on the physical core (e.g.
    misprediction or memory nuke)

    Count front-end penalty for branch misses (and any other stalls) - as long as it's low, it's not the reason for anything running slow.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文