从紧密的内部循环中调用微小函数的开销? [C++]

发布于 2024-08-27 09:41:39 字数 1046 浏览 9 评论 0原文

假设您看到一个像这样的循环:

for(int i=0;
    i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count();
    ++i)
{
  thing.getData().insert(
    thing.GetData().Count(),
    thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName()
    );
}

如果这是 Java,我可能不会三思而后行。但在 C++ 的性能关键部分,它让我想要修改它......但是我不知道编译器是否足够聪明,使其变得徒劳。 这是一个虚构的示例,但它所做的只是将字符串插入到容器中。请不要假设其中任何一个都是 STL 类型,一般性地思考以下问题:

  • for 循环中的混乱条件是每次都会被评估,还是只评估一次?
  • 如果这些 get 方法只是返回对对象上的成员变量的引用,它们会被内联吗?
  • 您希望自定义 [] 运算符得到优化吗?

换句话说,是否值得花时间(仅在性能方面,而不是可读性方面)将其转换为类似以下内容:

ElementContainer &source = 
   thing.getParent().getObjectModel().getElements(SOME_TYPE);
int num = source.count();
Store &destination = thing.getData();
for(int i=0;i<num;++i)
{
  destination.insert(thing.GetData().Count(), source[i].getName());
}

记住,这是一个紧密循环,每秒调用数百万次。我想知道的是,这一切是否会减少每个循环的几个周期或更实质性的东西?


是的,我知道关于“过早优化”的说法。我知道分析很重要。但这是关于现代编译器(尤其是 Visual Studio)的一个更普遍的问题。

Say you see a loop like this one:

for(int i=0;
    i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count();
    ++i)
{
  thing.getData().insert(
    thing.GetData().Count(),
    thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName()
    );
}

if this was Java I'd probably not think twice. But in performance-critical sections of C++, it makes me want to tinker with it... however I don't know if the compiler is smart enough to make it futile.
This is a made up example but all it's doing is inserting strings into a container. Please don't assume any of these are STL types, think in general terms about the following:

  • Is having a messy condition in the for loop going to get evaluated each time, or only once?
  • If those get methods are simply returning references to member variables on the objects, will they be inlined away?
  • Would you expect custom [] operators to get optimized at all?

In other words is it worth the time (in performance only, not readability) to convert it to something like:

ElementContainer &source = 
   thing.getParent().getObjectModel().getElements(SOME_TYPE);
int num = source.count();
Store &destination = thing.getData();
for(int i=0;i<num;++i)
{
  destination.insert(thing.GetData().Count(), source[i].getName());
}

Remember, this is a tight loop, called millions of times a second. What I wonder is if all this will shave a couple of cycles per loop or something more substantial?


Yes I know the quote about "premature optimisation". And I know that profiling is important. But this is a more general question about modern compilers, Visual Studio in particular.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

反差帅 2024-09-03 09:41:39

回答此类问题的一般方法是查看生成的组件。对于 gcc,这涉及用 -S 替换 -c 标志。

我自己的规则是不与编译器对抗。如果要内联某些内容,那么我会确保编译器拥有执行此类内联所需的所有信息,并且(可能)我会尝试使用显式的 inline 关键字敦促他这样做。

此外,内联可以节省一些操作码,但会使代码增长,就 L1 缓存而言,这可能对性能非常不利。

The general way to answer such questions is to looked at the produced assembly. With gcc, this involve replacing the -c flag with -S.

My own rule is not to fight the compiler. If something is to be inlined, then I make sure that the compiler has all the information needed to perform such an inline, and (possibly) I try to urge him to do so with an explicit inline keyword.

Also, inlining saves a few opcodes but makes the code grow, which, as far as L1 cache is concerned, can be very bad for performance.

挽梦忆笙歌 2024-09-03 09:41:39

您提出的所有问题都是特定于编译器的,因此唯一明智的答案是“这取决于”。如果这对您很重要,您应该(一如既往)查看编译器发出的代码并进行一些计时实验。确保您的代码在所有优化都打开的情况下进行编译 - 这对于诸如 operator[]() 之类的东西会产生很大的影响,它通常作为内联函数实现,但不会被内联(至少在 GCC 中)除非您打开优化。

All the questions you are asking are compiler-specific, so the only sensible answer is "it depends". If it is important to you, you should (as always) look at the code the compiler is emitting and do some timing experiments. Make sure your code is compiled with all optimisations turned on - this can make a big difference for things like operator[](), which is often implemented as an inline function, but which won't be inlined (in GCC at least) unless you turn on optimisation.

但可醉心 2024-09-03 09:41:39

如果循环那么关键,我只能建议您查看生成的代码。如果允许编译器积极优化调用,那么也许这不会成为问题。很遗憾地说,现代编译器可以优化得非常好,我真的建议进行分析以找到特定情况下的最佳解决方案。

If the loop is that critical, I can only suggest that you look at the code generated. If the compiler is allowed to aggressively optimise the calls away then perhaps it will not be an issue. Sorry to say this but modern compilers can optimise incredibly well and the I really would suggest profiling to find the best solution in your particular case.

段念尘 2024-09-03 09:41:39

如果方法很小并且可以并且将会被内联,那么编译器可能会执行与您所做的相同的优化。因此,查看生成的代码并进行比较。

编辑:将 const 方法标记为 const 也很重要,例如在您的示例中 count()getName() 应该是 const 让编译器知道这些方法不会改变给定对象的内容。

If the methods are small and can and will be inlined, then the compiler may do the same optimizations that you have done. So, look at the generated code and compare.

Edit: It is also important to mark const methods as const, e.g. in your example count() and getName() should be const to let the compiler know that these methods do not alter the contents of the given object.

听风吹 2024-09-03 09:41:39

一般来说,除非结果在循环执行期间发生变化,否则“for 条件”中不应包含所有垃圾。

在循环外使用另一个变量集。这将消除阅读代码时的WTF,不会对性能产生负面影响,并且会回避函数优化得如何的问题。如果这些调用没有优化,这也会导致性能提高。

As a rule, you should not have all that garbage in your "for condition" unless the result is going to be changing during your loop execution.

Use another variable set outside the loop. This will eliminate the WTF when reading the code, it will not negatively impact performance, and it will sidestep the question of how well the functions get optimized. If those calls are not optimized this will also result in performance increase.

转角预定愛 2024-09-03 09:41:39

我认为在这种情况下,您要求编译器做的事情超出了它可以合法访问的编译时信息的范围。因此,在特定情况下,混乱的情况可能会被优化掉,但实际上,编译器没有特别好的方法来知道长长的函数调用链可能会产生什么样的副作用。我认为打破测试会更快,除非我有基准测试(或反汇编)显示其他情况。

这是 JIT 编译器比 C++ 编译器具有很大优势的情况之一。原则上,它可以针对运行时看到的最常见情况进行优化,并为此提供优化的字节码(加上检查以确保属于该情况)。这种东西一直在多态方法调用中使用,但事实证明实际上并没有多态地使用;不过,我不确定它是否可以捕获像您的示例一样复杂的东西。

不管怎样,如果速度真的很重要,我也会用 Java 来分割它。

I think in this case you are asking the compiler to do more than it legitimately can given the scope of compile-time information it has access to. So, in particular cases the messy condition may be optimized away, but really, the compiler has no particularly good way to know what kind of side effects you might have from that long chain of function calls. I would assume that breaking out the test would be faster unless I have benchmarking (or disassembly) that shows otherwise.

This is one of the cases where the JIT compiler has a big advantage over a C++ compiler. It can in principle optimize for the most common case seen at runtime and provide optimized bytecode for that (plus checks to make sure that one falls into that case). This sort of thing is used all the time in polymorphic method calls that turn out not to actually be used polymorphically; whether it could catch something as complex as your example, though, I'm not certain.

For what it's worth, if speed really mattered, I'd split it up in Java too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文