当我们测量函数的性能时，如何清空缓存

发布于 2024-09-30 06:12:29 字数 302 浏览 2 评论 0原文

CPU缓存总是会中断我们测试某些代码的性能。

gettime();
func1();
gettime();

gettime();
func2();
gettime();
// func2 is faster because of the cache.(or page faults of func1())
// But we often misunderstand.

当您测量代码性能时，如何消除缓存的影响。

我正在寻找一些在 Windows 中执行此操作的函数或方法。
请给我你的好建议。谢谢。

原文

CPU cache always interrupts what we test a performance of some codes.

gettime();
func1();
gettime();

gettime();
func2();
gettime();
// func2 is faster because of the cache.(or page faults of func1())
// But we often misunderstand.

When you measure your code performance, how do you remove the cache's influence.

I'm finding some functions or ways to do this in Windows.
Please give me your nice tips. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高速公鹿 2024-10-07 06:12:29

您可以做的一件事是调用一个包含大量代码的函数，并在调用您正在分析的项目之间访问大量内存。例如，在伪代码中（主要是语言中立的）：

// loop some number of times
{
  //start timing
  profile_func();
  //stop timing
  //add to total time
  large_func(); // Uses lots of memory and has lots of code
}
// Compute time of profile func by dividing number of iterations by total time

large_func() 中的代码可以是无意义的代码，例如重复的某些操作集和
超过。关键是它或其代码在编译时不会得到优化，因此它实际上会清除 CPU 的代码和数据缓存（以及 L2 和 L3（如果存在）缓存）。

对于许多情况来说，这是一个非常重要的测试。它之所以重要，是因为经常单独分析的小型快速函数可以利用 CPU 缓存、内联和注册来运行得非常快。但是，很多时候，在大型应用程序中，由于调用这些快速函数的上下文，这些优势并不存在。

举个例子，仅仅通过在紧密循环中运行一百万次迭代来分析一个函数可能会显示该函数在 50 纳秒内执行。然后你使用我上面展示的框架运行它，突然间它的运行时间可以急剧增加到微秒，因为它不能再利用它拥有整个处理器（它的寄存器和缓存）的事实。

One thing you can do is to call a function that has a lot of code and accesses a lot of memory in between calls to the item you are profiling. For example, in pseudo code (to be mostly language neutral):

// loop some number of times
{
  //start timing
  profile_func();
  //stop timing
  //add to total time
  large_func(); // Uses lots of memory and has lots of code
}
// Compute time of profile func by dividing number of iterations by total time

The code in the large_func() can be nonsense code, like some set of ops repeated over and
over. The key is that it, or its code, does not get optimized out when you compile, so that it actually clears the code and data caches of the CPU (and, the L2 and L3 (if present) caches as well).

This is a very important test for many cases. The reason it is important is that small fast functions that are often profiled in isolation can run very fast, taking advantage of CPU cache, inlining and enregistration. But, often times, in large applications these advantages are absent, because of the context in which these fast functions are called.

As an example, just profiling a function by running it for a million iterations in a tight loop might show that the function executes in say 50 nanoseconds. Then you run it using the framework I showed above, and all of a sudden its running time can drastically increase to microseconds, because it can no longer take advantage of the fact that it has the entire processor - its registers and caches, to itself.

回复收藏 0 原文