Cachegrind 输出解释

发布于 2024-09-30 09:56:23 字数 899 浏览 8 评论 0原文

这是 Cachegrind 输出的一部分。这部分代码已经执行了1224次。 elmg1 是一个大小为 16 x 20 的 unsigned long 数组。我的机器 L1 缓存大小为 32KB，缓存行大小为 64B，8 路组关联。

对于 (i = 0; i < 20; i++) 78,336 2,448 2 50,184 0 0 1,224 0 0
{
telm01 = elmg1[i]; 146,880 0 0 73,440 0 0 24,480 0 0
telm31 = (telm01 << 3) ^ val1; 97,920 0 0 48,960 0 0 24,480 0 0
telm21 = (telm01 << 2) ^ (val1 >> 1); 146,880 1,224 1 48,960 0 0 24,480 0 0
telm11 = (telm01 << 1) ^ (val1 >> 2); 146,880 0 0 48,960 0 0 24,480 0 0
}

答：我把它放在这里的原因是，在 for 循环的第三行中，我看到许多 I1 未命中（还有一个 L2 未命中）。这有点令人困惑，我无法猜测原因是什么？

B. 我正在尝试优化（时间）部分代码。以上只是一个小片段。我认为在我的程序中内存访问花费了我很多。就像上面的例子一样，elmg1 是一个 16 x 20 大小的无符号长整型数组。当我尝试在代码中使用它时，总是会出现一些遗漏，并且在我的程序中这些变量出现很多。有什么建议吗？

C. 我需要分配并（有时初始化）这些无符号长整型。您能否建议我应该选择哪一种，calloc 或数组声明，然后显式初始化。顺便问一下，缓存处理它们的方式会有什么不同吗？

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

另类 2024-10-07 09:56:23

您是否尝试过展开循环？

我现在不担心 L1 失误。另外，1224 次中出现一次 L2 未命中也是可以的，CPU 必须在某个时刻将值加载到缓存中。
与程序的其余部分相比，此代码的 L2 缺失百分比是多少？
使用 calloc()，如果数组大小始终相同并且您使用常量作为大小，则编译器可以优化数组的清零。此外，唯一影响缓存行使用的是对齐方式，而不是它的初始化方式。

编辑：这个数字很难读，而且第一次读错了。

让我们确保我读取的第 5 行数字正确：

Ir    146,880
I1mr  1,224
ILmr  1
Dr    48,960
D1mr  0
DLmr  0
Dw    24,480
D1mw  0
DLmw  0

L1 高速缓存分为两个 32KByte 高速缓存，一个用于代码 I1，一个用于数据 D1。 IL& DL 是由数据和指令共享的 L2 或 L3 缓存。

大量的 I1mr 是指令未命中而不是数据未命中，这意味着循环代码正在从 I1 指令缓存中弹出。

I1 在第 1 行和第 1 行缺失5 总共 3672，即 1224 的 3 倍，因此每次运行循环时，您都会在 64 字节缓存行中获得 3 次 I1 缓存未命中，这意味着您的循环代码大小在 128-192 字节之间，以覆盖 3 个缓存行。因此，第 5 行 I1 未命中是因为循环代码穿过最后一个缓存行。

我建议使用 KCachegrind 查看 cachegrind 的结果

编辑：有关缓存行的更多信息。

该循环代码看起来不像自己被调用了 1224 次，因此这意味着有更多代码将该代码推出 I1 缓存。

您的 32Kbyte I1 缓存分为 512 个缓存线（每条 64 字节）。 “8 路组关联”部分意味着每个内存地址仅映射到 512 个高速缓存行中的 8 个。如果您分析的整个程序是一个连续的 32KB 内存块，那么它将全部放入 I1 缓存中，并且不会弹出任何内容。情况很可能并非如此，并且将有超过 8 个 64 字节的代码块满足相同的 8 个高速缓存行。假设您的整个程序有 1Mbyte 的代码（包括库），那么每组 8 个缓存行将有大约 32 (1Mbyte/32Kbyte) 段代码满足这 8 个缓存行。

阅读这篇 lwn.net 文章，了解有关 CPU 缓存的所有详细信息

编译器并不总是能够检测到哪个程序的函数将是热点（调用很多次），并且将是代码点（即错误处理程序代码，几乎从不运行）。 GCC 有函数属性 hot/cold 这将允许您将函数标记为热/冷，这将允许编译器将热函数分组在一个内存块中，以获得更好的缓存使用率（即冷代码不会将热代码推出缓存）。

无论如何，那些 I1 未命中确实不值得花时间去担心。

Have you tried to unroll the loop?

I wouldn't worry about L1 misses right now. Also one L2 miss out of 1224 times is ok, the cpu has to load the values into the cache at some point.
What percentage of L2 misses does this code cost compared to the rest of the program?
Use calloc(), if the array size is always the same and you use constants for the size, then the compiler can optimize the zero'ing of the array. Also the only thing that would effect the cache lines usages is alignment, not how it was initizliated.

edit: The number where hard to read that way and read them wrong the first time.

lets make sure I am reading the numbers right for line 5:

Ir    146,880
I1mr  1,224
ILmr  1
Dr    48,960
D1mr  0
DLmr  0
Dw    24,480
D1mw  0
DLmw  0

The L1 Cache is split into two 32KByte caches one for code I1 and one of data D1. IL & DL are the L2 or L3 cache which is shared by both data and instructions.

The large number of I1mr is instruction misses not data misses, this means that the loop code is being ejected from the I1 instruction cache.

I1 misses at line 1 & 5 total 3672 which is 3 times 1224, so each time the loop is run you get 3 I1 cache misses with 64Byte cache lines that means you loop code size is between 128-192 bytes to cover 3 cache lines. So those I1 misses at line 5 is because that is where the loop code crosses the last cache line.

I would recommend using KCachegrind for viewing the results from cachegrind

Edit: More about cache lines.

That loop code doesn't look like it is being call 1224 times by itself, so that means there is more code that is pushing this code out of the I1 cache.

Your 32Kbyte I1 cache is divided into 512 cache lines (64bytes each). The "8-way set associative" part means that each memory address is mapped to only 8 out of those 512 cache lines. If the whole program you are profile was one continuous block of 32Kbytes of memory, then it would all fit into the I1 cache and none would be ejected. That is mostlikely not the case and there will be more then 8 64byte blocks of code contenting for the same 8 cache lines. Lets say that your whole program has 1Mbyte of code (this includes libraries), then each group of 8 cache lines will have about 32 (1Mbyte/32Kbyte) pieces of code contenting for those same 8 cache lines.

Read this lwn.net article for all the gory details about CPU caches

The compiler can't always detect which functions of the program will be hotspots (called many many times) and which will be codespots (i.e. error handler code, which almost never runs). GCC has function attributes hot/cold which will allow you to mark functions as hot/cold, this will allow the compiler to group the hot functions together in one block of memory to get better cache usage (i.e. cold code will not be pushing hotcode out of the caches).

Anyways those I1 misses are really not worth the time to worry about.

回复收藏 0 原文

~没有更多了~