这段代码是否填满了CPU缓存?

发布于 2024-09-10 13:36:44 字数 1246 浏览 2 评论 0原文

我有两种方法来编程相同的功能。

方法 1:

doTheWork(int action)
{
    for(int i = 0 i < 1000000000; ++i)
    {
        doAction(action);
    }
}

方法 2:

doTheWork(int action)
{
    switch(action)
    {
    case 1:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1>();
        }
        break;
    case 2:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<2>();
        }
        break;
    //-----------------------------------------------
    //... (there are 1000000 cases here)
    //-----------------------------------------------
    case 1000000:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1000000>();
        }
        break;
    }
}

假设函数 doAction(int action) 和函数 templatedoAction() 由大约 10 行代码组成,这些代码将在编译时内联。调用 doAction(#) 在功能上等同于 doAction<#>(),但非模板化 doAction(int value) 是比 template稍慢一些doAction(),因为当编译时参数值已知时,可以在代码中完成一些很好的优化。

所以我的问题是,在模板化函数的情况下,是否所有数百万行代码都会填充 CPU L1 缓存(以及更多)(从而显着降低性能),或者仅填充 doAction<#> 行当前正在运行的循环内的;()是否被缓存?

I have two ways to program the same functionality.

Method 1:

doTheWork(int action)
{
    for(int i = 0 i < 1000000000; ++i)
    {
        doAction(action);
    }
}

Method 2:

doTheWork(int action)
{
    switch(action)
    {
    case 1:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1>();
        }
        break;
    case 2:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<2>();
        }
        break;
    //-----------------------------------------------
    //... (there are 1000000 cases here)
    //-----------------------------------------------
    case 1000000:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1000000>();
        }
        break;
    }
}

Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.

So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东北女汉子 2024-09-17 13:36:44

这取决于实际的代码大小 - 10 行代码可以多也可以少 - 当然也取决于实际的机器。

然而,方法 2 严重违反了这几十年来的经验法则:指令很便宜,而内存访问则不然。

可扩展性限制

您的优化通常是线性的 - 您可能会减少 10%、20% 甚至 30% 的执行时间。达到缓存限制是高度非线性的 - 就像“撞到砖墙”非线性一样。

一旦您的代码大小显着超过第 2/3 级缓存的大小,方法 2 将浪费大量时间,如以下对高端消费系统的估计所示:

  • DDR3-1333,峰值 10667MB/s内存带宽方面,
  • 具有 ~75000 MIPS 的英特尔酷睿 i7 Extreme 可

为您提供每条指令 10667MB / 75000M = 0.14 字节的盈亏平衡 - 任何更大的内存带宽都无法跟上 CPU 的速度。

典型的 x86 指令大小为 2..3 个字节,在 1..2 个周期内执行(现在,当然,这不一定是相同的指令,因为 x86 指令是分开的。仍然......)
典型的 x64 指令长度甚至更大。

您的缓存有多大帮助?
我找到了以下数字(来源不同,因此很难比较):
i7 Nehalem L2 缓存(256K,>200GB/s 带宽)几乎可以跟上 x86 指令,但可能无法跟上 x64。

此外,只有当

  • 您对下一条指令有完美的预测时,您的二级缓存才会完全启动,或者您没有首次运行损失并且它完全适合缓存,
  • 并且没有大量数据被占用。处理后,
  • “内部循环”中没有重要的其他代码,
  • 该核心上没有执行线程

鉴于此,您可能会更早丢失,特别是在具有较小缓存的CPU/主板上。

It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.

However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.

Scalability limit

Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.

As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:

  • DDR3-1333 with 10667MB/s peak memory bandwidth,
  • Intel Core i7 Extreme with ~75000 MIPS

gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.

Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.

How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.

In addition, your L2 cache will kick in completely only if

  • you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
  • there's no significant amount of data being processed
  • there's no significant other code in your "inner loop"
  • there's no thread executing on this core

Given that, you can lose much earlier, especially on a CPU/board with smaller caches.

︶ ̄淡然 2024-09-17 13:36:44

L1 指令高速缓存将仅包含最近获取的或预期不久的将来执行的指令。因此,第二种方法不能仅仅因为代码在那里而填充 L1 缓存。您的执行路径将导致它加载代表当前正在运行的循环的模板实例化版本。当您进入下一个循环时,它通常会使最近最少使用 (LRU) 缓存行失效,并将其替换为您接下来要执行的内容。

换句话说,由于两种方法的循环性质,L1 缓存在两种情况下都将表现出色,并且不会成为瓶颈。

The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.

In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文