如何使用性能计数器Linux计算L3缓存带宽?
我正在尝试使用linux perf来介绍L3高速缓存带宽gor python脚本。我看到没有可直接衡量的命令。但是我知道如何使用以下命令获得LLC性能计数器。谁能让我知道如何使用Perf计数器计算L3高速缓存带宽,或将我推荐给可用于测量L3 CACHE BANDWIDTH的任何工具?事先感谢您的帮助。
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches python hello.py
I am trying to use linux perf to profile the L3 cache bandwidth gor a python script. I see that there are no available commands to measure that directly. But I know how to get the llc performance counters using the below command. Can anyone let me know on how to calculate the L3 cache bandwidth using the perf counters or refer me to any tools that are available to measure the l3 cache bandwidth? Thanks in advance for the help.
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches python hello.py
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
更新:
perf
已更改,现在您想要perf Stat
-m tma_info_memory_core_l3_cache_access_bw
for L3带宽或用于DRAM带宽(L3 Fill = Misses,我想?
-M TMA_INFO_MEMORY_CORE_L3_CACHE_FILL_BW
(perf Stat -a -a -m tma_info_system_dram_bw_use -e task -clock,page -clock,page -capls,cycles,cycles,coce
)看来,他们衡量了总读取+写入频段,我认为“访问”频段可能是数量的读取。 +从核心和肮脏的写下来写作。使用 在DRAM中阅读和写作之间存在很大的速度差异,这是正常吗?为了避免使用epp =
性能
避免下垂。实际上,我发表了阅读的评论,以便该过程将整个时间都花在写作测试中,从而轻松使用perf
:我测量22.84 TMA_INFO_MEMORY_CORE_CORE_CORE_L3_CACHE_FILL_BW
在写入时intel_gpu_top
显示了14G+ b/s的峰值读取+ 14+ gb/s写入,平均少于启动(包括启动)。和37.36 TMA_INFO_MEMORY_CORE_CORE_CORE_CORE_CACHE_ACCESS_BW
在相同的测试中(这两个度量组活跃在相同的perf
运行中。读+写带宽,所以我相信。 (本段中的所有数字均来自同一运行,并且在 +-0.5 GB/s内的运行到运行相当一致)。 ,例如200mib/s读取,根据
intel_gpu_top
在DRAM控制器上进行测量的25 MIB/S写入。根据
Perf Perf List
在我的Skylake上,这些报告平均每核数据访问或在GB/s中填写带宽。 (因此,不计算指令获取,也许只读了吗?)我不是100%确定这些计数器的量度,但是下面我的旧答案中描述的度量组不再存在。我目前有perf 6.5。Perf Stat
具有一些名为“指标”,它知道如何从其他事物中计算。根据perf list
在我的系统上,其中包括l3_cache_access_bw
和l3_cache_fill_bw
。这是从我的系统中使用Skylake(i7-6700k)的。其他CPU(尤其是来自其他供应商和架构)可能对其有不同的支持,或者IDK可能根本不支持这些指标。
我尝试了一下eRatosthenes的简单筛(使用bool阵列,而不是位图),来自最新的Codereview问题,因为我有一个可基于基准的版本(带有重复循环)。它测量了52 GB/s的总带宽(我认为读+写作)。
n = 4000000问题大小,我使用的是总计4 MB,该总数大于256K L2尺寸,但小于8MIB L3尺寸。
或仅使用
-M L3_CACHE_ACCESS_BW
和no-e
事件,它仅显示offcore_requests.all_requests#54.52 l3_cache_accace_access_access_bw
and code> duration_time_time duration_time 。因此,它覆盖了默认值,并且不计算循环,指令
等。我认为这只是在计算该核心的所有核心请求,假设(正确)每个核心都涉及64个字节传输。它计算在L3缓存中是命中还是错过。与DRAM控制器上的Uncore瓶颈相比,大部分获得L3命中显然会启用更高的带宽。
update:
perf
has changed, now you wantperf stat
with-M tma_info_memory_core_l3_cache_access_bw
for L3 bandwidth or-M tma_info_memory_core_l3_cache_fill_bw
for DRAM bandwidth (L3 fill = misses, I think?)Or better
-M tma_info_system_dram_bw_use
should be more accurate, but only works system-wide. (perf stat -a -M tma_info_system_dram_bw_use -e task-clock,page-faults,cycles,instructions
)It seems they measure total read+write bandwidth, and I think "access" bandwidth might be counting reads+writes from the cores plus dirty write-back to DRAM. With the test code from There is a huge speed difference between reading and writing in DRAM, is this normal? (with
write
beforeread
to avoid CoW mapping to the same physical page of zeros) with EPP =performance
to avoid downclocking. Actually I commented out read so the process would spend its whole time in the write test, allowing easy use ofperf
: I measured22.84 tma_info_memory_core_l3_cache_fill_bw
during the write test whileintel_gpu_top
showed peaks of 14G+ B/s read + 14+ GB/s write, average less including startup. And37.36 tma_info_memory_core_l3_cache_access_bw
during the same test (both metric-groups active in the sameperf
run.)29.11 tma_info_system_dram_bw_use
seems more like the sum of DRAM read+write bandwidths, so I'd trust that. (All the numbers in this paragraph came from the same run, and run-to-run is fairly consistent, within +- 0.5 GB/s.)There should be negligible L3 hits during that test, and the rest of my system was idle, like 200MiB/s read, 25 MiB/s write according to
intel_gpu_top
which measures at the DRAM controllers.According to
perf list
on my Skylake, those reports average per-core data access or fill bandwidth in GB/s. (So not counting instruction fetch, and maybe only reads?) I'm not 100% sure exactly what these counters measure, but the metric-groups described in my old answer below don't exist anymore. I have perf 6.5 at the moment.perf stat
has some named "metrics" that it knows how to calculate from other things. According toperf list
on my system, those includeL3_Cache_Access_BW
andL3_Cache_Fill_BW
.This is from my system with a Skylake (i7-6700k). Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.
I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around. It measured 52 GB/s total bandwidth (read+write I think).
The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.
Or with just
-M L3_Cache_Access_BW
and no-e
events, it just showsoffcore_requests.all_requests # 54.52 L3_Cache_Access_BW
andduration_time
. So it overrides the default and doesn't countcycles,instructions
and so on.I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer. It's counted whether it hits or misses in L3 cache. Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.