英特尔 C++编译器了解执行了什么优化

发布于 2024-10-15 23:34:34 字数 750 浏览 2 评论 0原文

我有一个代码段，非常简单：

for( int i = 0; i < n; ++i)
{
  if( data[i] > c && data[i] < r )
  {
    --data[i];
  }
}

它是一个大型函数和项目的一部分。这实际上是对不同循环的重写，这被证明是耗时的（长循环），但我对两件事感到惊讶：

当 data[i] 像这样临时存储时：

for( int i = 0; i < n; ++i)
{
  const int tmp = data[i];
  if( tmp > c && tmp < r )
  {
    --data[i];
  }
}

它变得更慢。我并不认为这应该更快，但我不明白为什么它应该慢得多，编译器应该能够确定是否应该使用 tmp 。

但更重要的是，当我将代码段移动到一个单独的函数中时，它的速度变慢了大约四倍。我想了解发生了什么，所以我查看了 opt-report，在这两种情况下，循环都是矢量化的，并且似乎做了相同的优化。

所以我的问题是，什么可以对一个没有被调用一百万次但本身很耗时的函数产生如此大的影响？在选择报告中要寻找什么？

我可以通过保持它内联来避免它，但为什么让我烦恼。

更新：

我应该强调，我主要关心的是理解，为什么当移动到单独的函数时它会变得更慢。使用 tmp 变量给出的代码示例只是我在过程中遇到的一个奇怪的例子。

原文

I have a code segment which is as simple as :

for( int i = 0; i < n; ++i)
{
  if( data[i] > c && data[i] < r )
  {
    --data[i];
  }
}

It's a part of a large function and project. This is actually a rewrite of a different loop, which proved to be time consuming (long loops), but I was surprised by two things :

When data[i] was temporary stored like this :

for( int i = 0; i < n; ++i)
{
  const int tmp = data[i];
  if( tmp > c && tmp < r )
  {
    --data[i];
  }
}

It became more much slower. I don't claim this should be faster, but I can not understand why it should be so much slower, the compiler should be able to figure out if tmp should be used or not.

But more importantly when I moved the code segment into a separate function it became around four times slower. I wanted to understand what was going on, so I looked in the opt-report and in both cases the loop is vectorized and seem to do the same optimization.

So my question is what can make such a difference on a function which is not called a million times, but is time consuming in itself ? What to look for in the opt-report ?

I could avoid it by just keeping it inlined, but the why is bugging me.

UPDATE :

I should underline that my main concern is to understand, why it became slower, when moved to a separate function. The code example given with tmp variable, was just a strange example I encountered during the process.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小ぇ时光︴ 2024-10-22 23:34:34

您可能缺乏寄存器，编译器必须加载和存储。我非常确定本机 x86 汇编指令可以使用内存地址进行操作，即编译器可以保持这些寄存器空闲。但是通过将其设置为本地化，您可能会改变行为。别名和编译器可能无法证明更快的版本具有相同的语义，特别是如果这里存在某种形式的多线程，允许它更改代码。

该函数在新段中速度较慢，可能是因为函数调用不仅会破坏管道，还会导致指令缓存性能较差（参数推送/弹出等有额外的代码）。

教训：让编译器来做优化，它比你聪明。我并不是说这是一种侮辱，它也比我聪明。但实际上，尤其是英特尔编译器，这些人知道他们在针对自己的平台时在做什么。

编辑：更重要的是，您需要认识到编译器的目标是优化未优化的代码。它们的目的不是识别半优化的代码。具体来说，编译器将为每个优化设置一组触发器，如果您碰巧以不触发它们的方式编写代码，则可以避免执行优化，即使代码在语义上是相同的。

而且您还需要考虑实施成本。并非每个适合内联的函数都可以内联 - 只是因为内联该逻辑对于编译器来说太复杂而无法处理。我知道 VC++ 很少会内联循环，即使内联会带来好处。您可能会在英特尔编译器中看到这一点 - 编译器编写者只是认为不值得花时间来实现。

我在处理 VC++ 中的循环时遇到了这种情况 - 编译器会以稍微不同的格式为两个循环生成不同的程序集，即使它们都实现了相同的结果。当然，他们的标准库使用了理想的格式。您可能通过使用std::for_each和函数对象观察到加速。

You're probably register starved, and the compiler is having to load and store. I'm pretty sure that the native x86 assembly instructions can take memory addresses to operate on- i.e., the compiler can keep those registers free. But by making it local, you may changing the behaviour wrt. aliasing and the compiler may not be able to prove that the faster version has the same semantics, especially if there is some form of multiple threads in here, allowing it to change the code.

The function was slower when in a new segment likely because function calls not only can break the pipeline, but also create poor instruction cache performance (there's extra code for parameter push/pop/etc).

Lesson: Let the compiler do the optimizing, it's smarter than you. I don't mean that as an insult, it's smarter than me too. But really, especially the Intel compiler, those guys know what they're doing when targetting their own platform.

Edit: More importantly, you need to recognize that compilers are targetted at optimizing unoptimized code. They're not targetted at recognizing half-optimized code. Specifically, the compiler will have a set of triggers for each optimization, and if you happen to write your code in such a way as that they're not hit, you can avoid optimizations being performed even if the code is semantically identical.

And you also need to consider implementation cost. Not every function ideal for inlining can be inlined- just because inlining that logic is too complex for the compiler to handle. I know that VC++ will rarely inline with loops, even if the inlining yields benefit. You may be seeing this in the Intel compiler- that the compiler writers simply decided that it wasn't worth the time to implement.

I encountered this when dealing with loops in VC++- the compiler would produce different assembly for two loops in slightly different formats, even though they both achieved the same result. Of course, their Standard library used the ideal format. You may observe a speedup by using std::for_each and a function object.

回复收藏 0 原文