Linux C++:如何分析由于缓存未命中而浪费的时间?
我知道我可以使用 gprof 对我的代码进行基准测试。
但是,我有这个问题——我有一个智能指针,它具有额外的间接级别(将其视为代理对象)。
结果,我有了这个额外的层,它几乎影响所有功能,并且具有缓存功能。
有没有办法测量我的 CPU 由于缓存未命中而浪费的时间?
I know that I can use gprof to benchmark my code.
However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).
As a result, I have this extra layer that effects pretty much all functions, and screws with caching.
Is there a way to measure the time my CPU wastes due to cache misses?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
Linux 从 2.6.31 开始支持
perf
。这允许您执行以下操作:perf record -e LLC-loads,LLC-load-misses yourExecutable
性能报告
LLC-load-misses
行,注释
。您应该看到这些行(在汇编代码中,被原始源代码包围)和一个数字,该数字指示发生缓存未命中的行的最后一级缓存未命中的比例。Linux supports with
perf
from 2.6.31 on. This allows you to do the following:perf record -e LLC-loads,LLC-load-misses yourExecutable
perf report
LLC-load-misses
line,annotate
. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.你可以尝试 cachegrind 它是前端 kcachegrind。
You could try cachegrind and it's front-end kcachegrind.
您可以找到一个访问 CPU 性能计数器的工具。每个内核中可能有一个寄存器用于计算 L1、L2 等未命中次数。或者,Cachegrind 执行逐周期模拟。
然而,我不认为这会是有洞察力的。您的代理对象可能是通过它们自己的方法进行修改的。 传统的分析器会告诉您这些方法花费了多少时间。没有分析工具会告诉您如果没有缓存污染源,性能将如何提高。这是减少程序工作集的大小和结构的问题,这不容易推断。
快速 Google 搜索发现了
boost::intrusive_ptr
您可能会感兴趣。它似乎不支持像
weak_ptr
这样的东西,但是转换你的程序可能很简单,然后你就可以确定地知道非侵入式引用计数的成本。You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.
However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking. No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.
A quick Google search turned up
boost::intrusive_ptr
which might interest you. It doesn't appear to support something likeweak_ptr
, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.继续沿着 @Mike_Dunlavey 的答案:
首先,使用您最喜欢的工具获取基于时间的配置文件:VTune 或 PTU 或 OProf。
然后,获取缓存未命中配置文件。 L1 高速缓存未命中,或 L2 高速缓存未命中,或...
即,第一个配置文件将“花费的时间”与每个程序计数器相关联。
第二个将“高速缓存未命中次数”值与每个程序计数器相关联。
注意:我经常“减少”数据,通过函数或(如果我有技术)通过循环对其进行总结。或者按 64 字节的 bin。比较各个程序计数器通常没有用,因为性能计数器是模糊的 - 您看到报告缓存未命中的位置通常是与实际发生位置不同的几条指令。
好的,现在绘制这两个配置文件以进行比较。以下是我认为有用的一些图表:
“冰山”图表:X 轴是 PC,正 Y 轴是时间,负 Y 访问是缓存未命中。寻找同时上升和下降的地方。
(“交错”图表也很有用:同样的想法,X 轴是 PC,在 Y 轴上绘制时间和缓存缺失,但具有不同颜色的窄垂直线,通常是红色和蓝色。时间和缓存都很多的地方花费的未命中将有精细交错的红色和蓝色线,几乎看起来是紫色的,这延伸到 L2 和 L3 缓存未命中,所有这些都在同一张图表上。时间或缓存未命中,或者更好的是,时间或缓存未命中最大数据点的百分比。如果比例错误,您将看不到任何内容。)
XY 图表:对于每个。采样箱(PC、函数、循环或...)绘制一个点,其 X 坐标是标准化时间,Y 坐标是标准化缓存未命中。如果您在右上角获得大量数据点 - 大的 %age 时间和大的 %age 缓存未命中 - 这是有趣的证据。或者,忘记点数 - 如果上角所有百分比的总和很大...
请注意,不幸的是,您经常必须自己进行这些分析。上次我检查过 VTune 不适合你。我使用过 gnuplot 和 Excel。 (警告:Excel 在超过 64000 个数据点时会崩溃。)
更多建议:
如果您的智能指针是内联的,您可能会得到到处都是的计数。在理想的情况下,您将能够追溯到 PC 的原始源代码行。在这种情况下,您可能需要稍微推迟一下减少时间:查看所有单独的 PC;将它们映射回源代码行;然后将它们映射到原始函数中。许多编译器(例如 GCC)都有允许您执行此操作的符号表选项。
顺便说一句,我怀疑您的问题不在于智能指针导致缓存抖动。除非你正在做 smart_ptr到处都是。如果您正在执行 smart_ptr,并且 sizeof(Obj) + 大于 4*sizeof(Obj*) (并且如果 smart_ptr 本身并不大),那么它就不会那么多。
更可能的是,智能指针执行的额外间接级别导致了您的问题。
的是,我在午餐时与一个人交谈,他有一个使用句柄的引用计数智能指针,即间接级别,类似于
(我不会这样编码,但它用于说明。)
巧合 间接 this->handle->ptr 可能是一个很大的性能问题。或者甚至是三重间接,this->handle->ptr->field。至少,在具有 5 个周期 L1 缓存命中的机器上,每个 this->handle->ptr->field 将花费 10 个周期。并且比单个指针追逐更难重叠。但是,更糟糕的是,如果每个都是 L1 缓存未命中,即使到 L2 只有 20 个周期……嗯,隐藏 2*20=40 个周期的缓存未命中延迟比单个 L1 未命中要困难得多。
一般来说,避免智能指针中的间接级别是个好建议。您可以通过让智能指针指向对象和句柄来使智能指针更大,而不是指向所有智能指针所指向的句柄(句柄本身也指向对象)。 (这不再是通常所说的句柄,而更像是一个信息对象。)
例如
,无论如何 - 时间配置文件是您最好的朋友。
哦,是的 - 英特尔 EMON 硬件还可以告诉您在 PC 上等待了多少个周期。这可以区分大量的 L1 未命中和少量的 L2 未命中。
Continuing along the lines of @Mike_Dunlavey's answer:
First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.
Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...
I.e. the first profile associates a "time spent" with each program counter.
The second associates a "number of cache misses" value with each program counter.
Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.
OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:
"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.
("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)
XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...
Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)
More advice:
If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.
By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.
More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.
Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like
(I wouldn't code it this way, but it serves for exposition.)
The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.
In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)
E.g.
Anyway - a time profile is your best friend.
Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.
这取决于您使用的操作系统和 CPU。例如,对于 Mac OS X 和 x86 或 ppc,Shark 将进行缓存未命中分析。 Linux 上的 Zoom 也是如此。
It depends on what OS and CPU you are using. E.g. for Mac OS X and x86 or ppc, Shark will do cache miss profiling. Ditto for Zoom on Linux.
如果您运行的是 AMD 处理器,则可以获得 CodeAnalyst,显然像啤酒一样免费。
If you're running an AMD processor, you can get CodeAnalyst, apparently free as in beer.
我的建议是使用 PTU (性能英特尔的调优实用程序)。
该实用程序是 VTune 的直接后代,提供最佳可用的采样分析器。您将能够跟踪 CPU 在哪里花费或浪费时间(借助可用的硬件事件),并且不会减慢您的应用程序或扰乱配置文件。
当然,您将能够收集您正在查找的所有缓存行未命中事件。
My advice would be to use PTU (Performance Tuning Utility) from Intel.
This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile.
And of course you'll be able to gather all cache line misses events you are looking for.
另一个基于 CPU 性能计数器的分析工具是 oprofile。您可以使用 kcachegrind 查看其结果。
Another tool for CPU performance counter-based profiling is oprofile. You can view its results using kcachegrind.
这是一般答案。
例如,如果您的程序花费了 50% 的时间在缓存未命中中,那么当您暂停它时,程序计数器有 50% 的时间将位于它正在等待导致内存读取的确切位置。缓存未命中。
Here's kind of a general answer.
For example, if your program is spending, say, 50% of it's time in cache misses, then 50% of the time when you pause it the program counter will be at the exact locations where it is waiting for the memory fetches that are causing the cache misses.