可以在 C++ 中缓存虚拟函数查找吗?
假设我在抽象基类指针 mypointer->foo() 上有一个虚拟函数调用 foo()。当我的应用程序启动时,根据文件的内容,它选择实例化特定的具体类并将 mypointer 分配给该实例。在应用程序的剩余生命周期中,mypointer 将始终指向该具体类型的对象。我无法知道这个具体类型是什么(它可能由动态加载库中的工厂实例化)。我只知道在第一次创建具体类型的实例后,该类型将保持不变。指针可能并不总是指向同一个对象,但该对象将始终具有相同的具体类型。请注意,从技术上讲,类型是在“运行时”确定的,因为它基于文件的内容,但在“启动”(加载文件)之后,类型是固定的。
然而,在 C++ 中,每次在应用程序的整个持续时间内调用 foo 时,我都会支付虚拟函数查找成本。编译器无法优化查找,因为它无法知道具体类型在运行时不会改变(即使它是有史以来最令人惊奇的编译器,它也无法推测动态加载的行为图书馆)。在 Java 或 .NET 等 JIT 编译语言中,JIT 可以检测到同一类型被反复使用,并执行 内联缓存。我基本上正在寻找一种方法来手动为 C++ 中的特定指针执行此操作。
C++ 有没有办法缓存这个查找?我意识到解决方案可能相当黑客。如果可以编写配置测试来发现 ABI/编译器的相关方面,使其“实际上可移植”,即使不是真正可移植的,我愿意接受 ABI/编译器特定的 hack。
更新:致反对者:如果这不值得优化,那么我怀疑现代 JIT 是否会这么做。您是否认为 Sun 和 MS 的工程师在实现内联缓存方面浪费了时间,并且没有对其进行基准测试以确保有所改进?
Say I have a virtual function call foo() on an abstract base class pointer, mypointer->foo(). When my app starts up, based on the contents of a file, it chooses to instantiate a particular concrete class and assigns mypointer to that instance. For the rest of the app's life, mypointer will always point to objects of that concrete type. I have no way to know what this concrete type is (it may be instantiated by a factory in a dynamically loaded library). I only know that the type will stay the same after the first time an instance of the concrete type is made. The pointer may not always point to the same object, but the object will always be of the same concrete type. Notice that the type is technically determined at 'runtime' because it's based on the contents of a file, but that after 'startup' (file is loaded) the type is fixed.
However, in C++ I pay the virtual function lookup cost every time foo is called for the entire duration of the app. The compiler can't optimize the look up away because there's no way for it to know that the concrete type won't vary at runtime (even if it was the most amazing compiler ever, it can't speculate on the behavior of dynamically loaded libraries). In a JIT compiled language like Java or .NET the JIT can detect that the same type is being used over and over and do inline cacheing. I'm basically looking for a way to manually do that for specific pointers in C++.
Is there any way in C++ to cache this lookup? I realize that solutions might be pretty hackish. I'm willing to accept ABI/compiler specific hacks if it's possible to write configure tests that discover the relevant aspects of the ABI/compiler so that it's "practically portable" even if not truly portable.
Update: To the naysayers: If this wasn't worth optimizing, then I doubt modern JITs would do it. Do you think Sun and MS's engineers were wasting their time implementing inline cacheing, and didn't benchmark it to ensure there was an improvement?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
因此,假设这是您想要解决的一个基本问题(以避免过早的优化参数),并且忽略特定于平台和编译器的黑客行为,您可以在复杂性的两端做两件事之一:
So assuming that this is a fundamental issue you want to solve (to avoid premature optimization arguments), and ignoring platform and compiler specific hackery, you can do one of two things, at opposite ends of complexity:
所有答案都处理最简单的场景,其中调用虚拟方法只需要获取要调用的实际方法的地址。一般情况下,当多重继承和虚拟继承发挥作用时,调用虚拟方法需要移动
this
指针。方法分派机制可以通过多种方式实现,但通常会发现虚拟表中的条目不是要调用的实际方法,而是由编译器插入的一些中间“蹦床”代码,这些代码重新定位 <在调用实际方法之前的 code>this 指针。
当调度是最简单的,只是一个额外的指针重定向时,那么尝试优化它是没有意义的。当问题更加复杂时,任何解决方案都将依赖于编译器并且是黑客行为。而且,你甚至不知道你处于什么场景:如果对象是从dll加载的,那么你并不真正知道返回的实际实例是属于简单的线性继承层次结构还是更复杂的场景。
All answers are dealing with the most simple scenario, where calling a virtual method only requires getting the address of the actual method to call. In the general case, when multiple and virtual inheritance come into play, calling a virtual method requires shifting the
this
pointer.The method dispatch mechanism can be implemented in more than one way, but it is common to find that the entry in the virtual table is not the actual method to call, but rather some intermediate 'trampoline' code inserted by the compiler that relocates the
this
pointer prior to calling the actual method.When the dispatch is the simplest, just an extra pointer redirection, then trying to optimize it does not make sense. When the problem is more complex, then any solution will be compiler dependent and hackerish. Moreover, you do not even know in what scenario you are: if the objects are loaded from dlls then you don't really know whether the actual instance returned belongs to a simple linear inheritance hierarchy or a more complex scenario.
因此,您基本上要做的是将运行时多态性转换为编译时多态性。现在,您仍然需要构建您的应用程序,以便它可以处理多个“案例”,但一旦决定哪种情况适用于运行,那么在持续时间内就这样了。
这是运行时多态性案例的模型:
在我的 Core2 上执行大约需要 14 秒,使用 gcc 4.3.2(32 位 Debian)、
-O3
选项编译。现在假设我们用模板化版本替换“工作”版本(根据它将要处理的具体类型进行模板化):
main
实际上不需要更新,但请注意,这两个调用towork
现在触发两个不同且特定于类型的函数的实例化和调用(参见之前的一个多态函数)。嘿,急速运行只需 0.001 秒。对于 2 条线路的更改来说,这是一个不错的加速因素!但是,请注意,巨大的加速完全归功于编译器,一旦消除了work函数中运行时多态性的可能性,只需优化循环并将结果直接编译到代码中即可。但这实际上很重要:根据我的经验,使用此类技巧的主要收益来自于改进内联和优化的机会,它们允许编译器在生成较少多态性、更具体的函数时,而不是< /em> 仅仅删除 vtable 间接(这确实非常便宜)。
但我真的不建议这样做,除非分析绝对表明运行时多态性确实影响了您的性能。一旦有人子类化
Foo
或Bar
并尝试将其传递到实际用于其基础的函数中,它也会咬你。您可能会发现这个相关问题也很有趣。
So, what you basically want to do is convert runtime polymorphism into compile time polymorphism. Now you still need to build your app so that it can handle multiple "cases", but once it's decided which case is applicable to a run, that's it for the duration.
Here's a model of the runtime polymorphism case:
This takes ~14s to execute on my Core2, compiled with gcc 4.3.2 (32 bit Debian),
-O3
option.Now suppose we replace the "work" version with a templated version (templated on the concrete type it's going to be working on):
main
doesn't actually need to be updated, but note that the 2 calls towork
now trigger instantiations of and calls to two different and type-specific functions (c.f the one polymorphic function previously).Hey presto runs in 0.001s. Not a bad speed up factor for a 2 line change! However, note that the massive speed up is entirely due to the compiler, once the possibility of runtime polymorphism in the
work
function is eliminated, just optimizing away the loop and compiling the result directly into the code. But that actually makes an important point: in my experience the main gains from using this sort of trick come from the opportunities for improved inlining and optimisation they allow the compiler when a less-polymorphic, more specific function is generated, not from the mere removal of vtable indirection (which really is very cheap).But I really don't recommend doing stuff like this unless profiling absolutely indicates runtime polymorphism is really hitting your performance. It'll also bite you as soon as someone subclasses
Foo
orBar
and tries to pass that into a function actually intended for its base.You might find this related question interesting too.
我见过避免虚函数调用是有益的情况。在我看来,这不是其中一种情况,因为您实际上是在多态地使用该函数。您只是在追求一个额外的地址间接,而不是一个巨大的打击,并且在某些情况下可能会部分优化掉。如果这确实很重要,您可能需要重组代码,以便减少依赖于类型的选择(例如虚拟函数调用),并将其拉出循环。
如果您确实认为值得一试,您可以设置一个单独的函数指针,指向特定于该类的非虚拟函数。我可能(但可能不会)考虑这样做。
除了使算法本身更明智之外,我怀疑任何手动优化虚拟函数调用的尝试都会导致比它解决的问题更多的问题。
I have seen situations where avoiding a virtual function call is beneficial. This does not look to me to be one of those cases because you really are using the function polymorphically. You are just chasing one extra address indirection, not a huge hit, and one that might be partially optimized away in some situations. If it really does matter, you may want to restructure your code so that type-dependent choices such as virtual function calls are made fewer times, pulled outside of loops.
If you really think it's worth giving it a shot, you can set a separate function pointer to a non-virtual function specific to the class. I might (but probably wouldn't) consider doing it this way.
Other than making the algorithm itself a bit wiser, I suspect any attempt to manually optimize the virtual function call will cause more problems than it solves.
您不能使用方法指针,因为指向成员函数的指针不被视为协变返回类型。请参见下面的示例:
另一个选项是声明该方法
pt2base method()
但返回结果将无效,因为 der::testmethod 不是 pt2base 类型。此外,即使您有一个接收 ptr 或对基类型的引用的方法,您也必须在该方法中将其动态转换为派生类型,以执行任何特别多态的操作,这会增加我们试图节省的成本。
You can't use a method pointer because pointers to member functions aren't considered covariant return types. See the example below:
The other option would be to have the method declared
pt2base method()
but then the return would be invalid because der::testmethod is not of type pt2base.Also even if you had a method that received a ptr or reference to the base type you would have to dynamically cast it to the derived type in that method to do anything particularly polymorphic which adds back in the cost we're trying to save.
我最近问了一个非常类似的问题,得到的答案是它可以作为 GCC 扩展,但不可移植:
C++:指向虚拟成员函数的单态版本的指针?
特别是,我也用 Clang 尝试过它,但它不支持此扩展(即使它支持许多其他 GCC 扩展)。
I asked a very similar question recently, and got the answer that it's possible as a GCC extension, but not portably:
C++: Pointer to monomorphic version of virtual member function?
In particular, I also tried it with Clang and it doesn't support this extension (even though it supports many other GCC extensions).
你能使用方法指针吗?
这里的目标是编译器将加载已解析方法或函数的位置的指针。这种情况只会发生一次。分配后,代码将以更直接的方式访问该方法。
我知道指向对象的指针并通过对象点访问方法会调用运行时多态性。然而,应该有一种方法来加载指向已解析方法的方法指针,避免多态性并直接调用函数。
我检查了社区维基以引入更多讨论。
Could you use a method pointer?
The objective here is that the compiler would load the pointer with the location of the resolved method or function. This would occur once. After the assignment, the code would access the method in a more direct fashion.
I know that a pointer to an object and accessing the method via the object point invokes run-time polymorphism. However, there should be a way to load a method pointer to a resolved method, avoiding the polymorphism and directly calling the function.
I've checked the community wiki to introduce more discussion.
虚拟函数调用有两个成本:vtable 查找和函数调用。
vtable 查找已经由硬件处理。现代 CPU(假设您使用的不是非常简单的嵌入式 CPU)将在其分支预测器中预测虚拟函数的地址,并推测性地与数组查找并行执行它。 vtable 查找与函数的推测执行并行发生的事实意味着,在您描述的情况下在循环中执行时,与直接的非内联函数调用相比,虚拟函数调用的开销几乎为零.
我过去实际上已经测试过这个,尽管是使用 D 编程语言,而不是 C++。当在编译器设置中禁用内联并且我在循环中调用同一函数数百万次时,无论该函数是否为虚拟函数,时间都在彼此的 epsilon 范围内。
虚拟函数的第二个也是更重要的成本是它们在大多数情况下阻止函数的内联。这比听起来更重要,因为内联是一种优化,可以实现其他几种优化,例如在某些情况下进行常量折叠。没有办法在不重新编译代码的情况下内联函数。 JIT 可以解决这个问题,因为它们在应用程序执行期间不断重新编译代码。
There are two costs to a virtual function call: The vtable lookup and the function call.
The vtable lookup is already taken care of by the hardware. Modern CPUs (assuming you're not working on a very simple embedded CPU) will predict the address of the virtual function in their branch predictor and speculatively execute it in parallel with the array lookup. The fact that the vtable lookup happens in parallel with the speculative execution of the function means that, when executed in a loop in the situations you describe, virtual function calls have next to zero overhead compared to direct, non-inlined function calls.
I've actually tested this in the past, albeit in the D programming language, not C++. When inlining was disabled in the compiler settings and I called the same function in a loop several million times, the timings were within epsilon of each other whether the function was virtual or not.
The second and more important cost of virtual functions is that they prevent inlining of the function in most cases. This is even more important than it sounds because inlining is an optimization that can enable several other optimizations such as constant folding in some cases. There's no way to inline a function without recompiling the code. JITs get around this because they're constantly recompiling code during the execution of your application.
为什么虚拟通话费用昂贵?因为在运行时执行代码之前您根本不知道分支目标。即使是现代的CPU仍然可以完美地处理虚拟调用和间接调用。我们不能简单地说它不需要任何成本,因为我们只是拥有更快的 CPU。不,事实并非如此。
1.我们怎样才能让它更快?
你已经对这个问题有了相当深入的了解。但是,我唯一可以说的是,如果虚函数调用很容易预测,那么你就可以执行软件级优化。但是,如果不是(即,您真的不知道虚拟函数的目标是什么),那么我认为目前没有好的解决方案。即使对于CPU来说,在这种极端情况下也很难预测。
实际上,诸如 Visual C++ 的 PGO(Profiling Guided Optimization)之类的编译器具有虚拟调用推测优化(链接)。如果分析结果可以枚举热虚拟函数目标,则它会转换为可以内联的直接调用。这也称为去虚拟化。在一些Java动态优化器中也可以找到它。
2.对于那些说没有必要的人
如果您使用脚本语言、C# 并担心编码效率,是的,它毫无价值。然而,如果任何人渴望节省单个周期以获得更好的性能,那么间接分支仍然是一个重要的问题。即使是最新的 CPU 也无法很好地处理虚拟调用。一个很好的例子是虚拟机或解释器,它们通常有一个非常大的开关盒。它的性能与间接分支的正确预测有很大关系。所以,不能简单地说它太低级或者没有必要。有数百人在底层努力提高绩效。这就是为什么你可以简单地忽略这些细节:)
3。一些与虚拟函数相关的无聊的计算机架构事实
dsimcha已经为CPU如何有效处理虚拟调用写了一个很好的答案。但是,这并不完全正确。首先,所有现代 CPU 都有分支预测器,它从字面上预测分支的结果以增加管道吞吐量(或者,指令级别的更多并行性,或 ILP。我什至可以说,单线程CPU性能完全取决于你能从单个线程中提取多少ILP,而分支预测是获得更高ILP的最关键因素。
在分支预测中,有两个预测:(1)方向(即分支被采用?或不被采用?二进制答案),以及(2)分支目标(即我将去哪里?这不是二进制答案)。根据预测,CPU推测地执行代码。如果推测不正确,则 CPU 会回滚并从错误预测的分支重新启动。这完全隐藏在程序员的视野之外。因此,除非您使用 VTune 进行分析,否则您并不真正知道 CPU 内部发生了什么,VTune 会给出分支错误预测率。
一般来说,分支方向预测的准确率很高(95%+),但预测分支目标仍然很困难,特别是虚拟调用和switch-case(即跳转表)。虚拟调用是间接分支,需要更多的内存负载,并且CPU需要分支目标预测。 Intel 的 Nehalem 和 AMD 的 Phenom 等现代 CPU 都有专门的间接分支目标表。
但是,我不认为查找 vtable 会产生很多开销。是的,它需要更多的内存负载,这可能会导致缓存丢失。但是,一旦 vtable 被加载到缓存中,那么它几乎就会命中缓存。如果您也担心这个成本,您可以放置预取代码来提前加载 vtable。但是,虚拟函数调用的真正困难在于CPU无法很好地预测虚拟调用的目标,这可能会因目标的错误预测而频繁导致管道耗尽。
Why virtual call is expensive? Because you simply don't know the branch target until the code is executed in runtime. Even modern CPUs are still perfectly handling the virtual call and indirect calls. One can't simply say it costs nothing because we just have a faster CPU. No, it is not.
1. How can we make it fast?
You already have pretty deep understanding the problem. But, the only I can say that if the virtual function call is easy to predict, then you could perform software-level optimization. But, if it's not (i.e., you have really no idea what would be the target of the virtual function), then I don't think that there is good solution for now. Even for CPU, it is hard to predict in such extreme case.
Actually, compilers such as Visual C++'s PGO(Profiling guided optimization) has virtual call speculation optimization (Link). If the profiling result can enumerate hot virtual function targets, then it translate to direct call which can be inlined. This is also called devirtualization. It can be also found in some Java dynamic optimizer.
2. To those one who say it's not necessary
If you're using script languages, C# and concern about the coding efficiency, yes, it's worthless. However, anyone who are eager to save a single cycle to obtain better performance, then indirect branch is still important problem. Even the latest CPUs are not good to handle virtual calls. One good example would be a virtual machine or interpreter, which usually have a very large switch-case. Its performance is pretty much related to the correct prediction of indirect branch. So, you can't simply say it's too low-level or not necessary. There are hundreds of people who are trying to improve the performance in the bottom. That's why you can simply ignore such details :)
3. Some boring computer architectural facts related to virtual functions
dsimcha has written a good answer for how CPU can handle virtual call effectively. But, it's not exactly correct. First, all modern CPUs have branch predictor, which literally predicts the outcomes of a branch to increase pipeline throughput (or, more parallelism in instruction level, or ILP. I can even say that single-thread CPU performance is solely depending on how much you can extract ILP from a single thread. Branch prediction is the most critical factor for obtaining higher ILP).
In branch prediction, there are two predictions: (1) direction (i.e., the branch is taken? or not taken? binary answer), and (2) branch target (i.e., where will I go? it's not binary answer). Based on the prediction, CPU speculatively execute the code. If the speculation is not correct, then CPU rollbacks and restarts from the mis-predicted branch. This is completely hidden from programmer's view. So, you don't really know what's going on inside the CPU unless you're profiling with VTune which gives branch misprediction rates.
In general, branch direction prediction is highly accurate(95%+), but it is still hard to predict branch targets, especially virtual calls and switch-case(i.e., jump table). Vrtual call is indirect branch which requires a more memory load, and also CPU requires branch target prediction. Modern CPUs like Intel's Nehalem and AMD's Phenom have specialized indirect branch target table.
However, I don't think looking up vtable incurs a lot of overhead. Yes, it requires a more memory load which can make cache miss. But, once vtable is loaded into cache, then it's pretty much cache hit. If you're also concerned with that cost, you may put prefetching code to load vtable in advance. But, the real difficulty of virtual function call is that CPU can't do great job to predict the target of virtual call, which may result in pipeline drain frequently due to misprediction of the target.