当前位置：文江博客话题详情

Linux C++：如何分析由于缓存未命中而浪费的时间？

发布于 2024-08-26 04:17:51 字数 159 浏览 6 评论 0原文

我知道我可以使用 gprof 对我的代码进行基准测试。

但是，我有这个问题——我有一个智能指针，它具有额外的间接级别（将其视为代理对象）。

结果，我有了这个额外的层，它几乎影响所有功能，并且具有缓存功能。

有没有办法测量我的 CPU 由于缓存未命中而浪费的时间？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

憧憬巴黎街头的黎明 2024-09-02 04:17:52

Linux 从 2.6.31 开始支持 perf。这允许您执行以下操作：

使用 -g 编译代码以包含调试信息
运行代码，例如使用最后一级缓存未命中计数器：perf record -e LLC-loads,LLC-load-misses yourExecutable
运行性能报告
- 确认初始消息后，选择 LLC-load-misses 行，
- 然后例如第一个函数和
- 然后注释。您应该看到这些行（在汇编代码中，被原始源代码包围）和一个数字，该数字指示发生缓存未命中的行的最后一级缓存未命中的比例。

回复收藏 0 原文

遮了一弯 2024-09-02 04:17:52

你可以尝试 cachegrind 它是前端 kcachegrind。

回复收藏 0 原文

醉生梦死 2024-09-02 04:17:52

您可以找到一个访问 CPU 性能计数器的工具。每个内核中可能有一个寄存器用于计算 L1、L2 等未命中次数。或者，Cachegrind 执行逐周期模拟。

然而，我不认为这会是有洞察力的。您的代理对象可能是通过它们自己的方法进行修改的。 传统的分析器会告诉您这些方法花费了多少时间。没有分析工具会告诉您如果没有缓存污染源，性能将如何提高。这是减少程序工作集的大小和结构的问题，这不容易推断。

快速 Google 搜索发现了 boost::intrusive_ptr您可能会感兴趣。它似乎不支持像 weak_ptr 这样的东西，但是转换你的程序可能很简单，然后你就可以确定地知道非侵入式引用计数的成本。

回复收藏 0 原文

肤浅与狂妄 2024-09-02 04:17:52

继续沿着 @Mike_Dunlavey 的答案：

首先，使用您最喜欢的工具获取基于时间的配置文件：VTune 或 PTU 或 OProf。

然后，获取缓存未命中配置文件。 L1 高速缓存未命中，或 L2 高速缓存未命中，或...

即，第一个配置文件将“花费的时间”与每个程序计数器相关联。
第二个将“高速缓存未命中次数”值与每个程序计数器相关联。

注意：我经常“减少”数据，通过函数或（如果我有技术）通过循环对其进行总结。或者按 64 字节的 bin。比较各个程序计数器通常没有用，因为性能计数器是模糊的 - 您看到报告缓存未命中的位置通常是与实际发生位置不同的几条指令。

好的，现在绘制这两个配置文件以进行比较。以下是我认为有用的一些图表：

“冰山”图表：X 轴是 PC，正 Y 轴是时间，负 Y 访问是缓存未命中。寻找同时上升和下降的地方。

（“交错”图表也很有用：同样的想法，X 轴是 PC，在 Y 轴上绘制时间和缓存缺失，但具有不同颜色的窄垂直线，通常是红色和蓝色。时间和缓存都很多的地方花费的未命中将有精细交错的红色和蓝色线，几乎看起来是紫色的，这延伸到 L2 和 L3 缓存未命中，所有这些都在同一张图表上。时间或缓存未命中，或者更好的是，时间或缓存未命中最大数据点的百分比。如果比例错误，您将看不到任何内容。）

XY 图表：对于每个。采样箱（PC、函数、循环或...）绘制一个点，其 X 坐标是标准化时间，Y 坐标是标准化缓存未命中。如果您在右上角获得大量数据点 - 大的 %age 时间和大的 %age 缓存未命中 - 这是有趣的证据。或者，忘记点数 - 如果上角所有百分比的总和很大...

请注意，不幸的是，您经常必须自己进行这些分析。上次我检查过 VTune 不适合你。我使用过 gnuplot 和 Excel。（警告：Excel 在超过 64000 个数据点时会崩溃。）

更多建议：

如果您的智能指针是内联的，您可能会得到到处都是的计数。在理想的情况下，您将能够追溯到 PC 的原始源代码行。在这种情况下，您可能需要稍微推迟一下减少时间：查看所有单独的 PC；将它们映射回源代码行；然后将它们映射到原始函数中。许多编译器（例如 GCC）都有允许您执行此操作的符号表选项。

顺便说一句，我怀疑您的问题不在于智能指针导致缓存抖动。除非你正在做 smart_ptr到处都是。如果您正在执行 smart_ptr，并且 sizeof(Obj) + 大于 4*sizeof(Obj*) （并且如果 smart_ptr 本身并不大），那么它就不会那么多。

更可能的是，智能指针执行的额外间接级别导致了您的问题。

的是，我在午餐时与一个人交谈，他有一个使用句柄的引用计数智能指针，即间接级别，类似于

template<typename T> class refcntptr {
    refcnt_handle<T> handle;
public:
    refcntptr(T*obj) {
        this->handle = new refcnt_handle<T>();
        this->handle->ptr = obj;
        this->handle->count = 1;
    }
};
template<typename T> class refcnt_handle {
    T* ptr;
    int count;
    friend refcnt_ptr<T>;
};

（我不会这样编码，但它用于说明。）

巧合间接 this->handle->ptr 可能是一个很大的性能问题。或者甚至是三重间接，this->handle->ptr->field。至少，在具有 5 个周期 L1 缓存命中的机器上，每个 this->handle->ptr->field 将花费 10 个周期。并且比单个指针追逐更难重叠。但是，更糟糕的是，如果每个都是 L1 缓存未命中，即使到 L2 只有 20 个周期……嗯，隐藏 2*20=40 个周期的缓存未命中延迟比单个 L1 未命中要困难得多。

一般来说，避免智能指针中的间接级别是个好建议。您可以通过让智能指针指向对象和句柄来使智能指针更大，而不是指向所有智能指针所指向的句柄（句柄本身也指向对象）。（这不再是通常所说的句柄，而更像是一个信息对象。）

例如

template<typename T> class refcntptr {
    refcnt_info<T> info;
    T* ptr;
public:
    refcntptr(T*obj) {
        this->ptr = obj;
        this->info = new refcnt_handle<T>();
        this->info->count = 1;
    }
};
template<typename T> class refcnt_info {
    T* ptr; // perhaps not necessary, but useful.
    int count;
    friend refcnt_ptr<T>;
};

，无论如何 - 时间配置文件是您最好的朋友。

哦，是的 - 英特尔 EMON 硬件还可以告诉您在 PC 上等待了多少个周期。这可以区分大量的 L1 未命中和少量的 L2 未命中。

Continuing along the lines of @Mike_Dunlavey's answer:

First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.

Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...

I.e. the first profile associates a "time spent" with each program counter.
The second associates a "number of cache misses" value with each program counter.

Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.

OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:

"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.

("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)

XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...

Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)

More advice:

If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.

By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.

More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.

Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like

template<typename T> class refcntptr {
    refcnt_handle<T> handle;
public:
    refcntptr(T*obj) {
        this->handle = new refcnt_handle<T>();
        this->handle->ptr = obj;
        this->handle->count = 1;
    }
};
template<typename T> class refcnt_handle {
    T* ptr;
    int count;
    friend refcnt_ptr<T>;
};

(I wouldn't code it this way, but it serves for exposition.)

The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.

In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)

E.g.

template<typename T> class refcntptr {
    refcnt_info<T> info;
    T* ptr;
public:
    refcntptr(T*obj) {
        this->ptr = obj;
        this->info = new refcnt_handle<T>();
        this->info->count = 1;
    }
};
template<typename T> class refcnt_info {
    T* ptr; // perhaps not necessary, but useful.
    int count;
    friend refcnt_ptr<T>;
};

Anyway - a time profile is your best friend.

Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.

回复收藏 0 原文