为什么 std::any 的实现使用函数指针 +函数操作码,而不是指向虚拟表的指针 +虚拟通话?
gcc 和 llvm 将功能指针存储在任何
对象中,并使用op
/操作
参数来调用该功能以执行不同的操作。这是LLVM的该函数的示例:
static void* __handle(_Action __act, any const * __this,
any * __other, type_info const * __info,
void const* __fallback_info)
{
switch (__act)
{
case _Action::_Destroy:
__destroy(const_cast<any &>(*__this));
return nullptr;
case _Action::_Copy:
__copy(*__this, *__other);
return nullptr;
case _Action::_Move:
__move(const_cast<any &>(*__this), *__other);
return nullptr;
case _Action::_Get:
return __get(const_cast<any &>(*__this), __info, __fallback_info);
case _Action::_TypeInfo:
return __type_info();
}
__libcpp_unreachable();
}
注意:这只是一个__ handle
函数,但是每个> abory
实现中有两个这样的功能:一个用于在中分配的任何
中分配的小对象(小缓冲区优化),另一个用于分配在堆上的大对象。使用哪一个取决于存储在任何
对象中的函数指针的值。
在运行时选择两个实现之一并从预定义方法列表中调用特定方法的能力本质上是虚拟表的手动实现。我想知道为什么要以这种方式实施。简单地存储指向虚拟类型的指针会更容易吗?
我找不到有关此实施原因的任何信息。考虑到这一点,我想使用虚拟类在两种方式上是最佳的:
- 它需要一个对象实例并管理单身人士,而实际上,实际上是一个vtable(无实例)就足够了。
- 在
上调用函数任何
都将涉及两个间接性:首先通过存储在的指针中,以获取VTable,然后通过存储在VTable中的指针。我不确定这是否与上面的基于
Switch
的方法有什么不同。
这些是基于switch
-ing op编码的实现的原因吗?当前实施还有其他主要优势吗?您知道有关此技术的一般信息的链接吗?
Both the GCC and LLVM implementations of std::any
store a function pointer in the any
object and call that function with an Op
/Action
argument to perform different operations. Here is an example of that function from LLVM:
static void* __handle(_Action __act, any const * __this,
any * __other, type_info const * __info,
void const* __fallback_info)
{
switch (__act)
{
case _Action::_Destroy:
__destroy(const_cast<any &>(*__this));
return nullptr;
case _Action::_Copy:
__copy(*__this, *__other);
return nullptr;
case _Action::_Move:
__move(const_cast<any &>(*__this), *__other);
return nullptr;
case _Action::_Get:
return __get(const_cast<any &>(*__this), __info, __fallback_info);
case _Action::_TypeInfo:
return __type_info();
}
__libcpp_unreachable();
}
Note: This is just one __handle
function but there there are two such functions in each any
implementation: one for small objects (Small buffer optimization) allocated within any
and one for big objects allocated on the heap. Which one is used depends on the value of the function pointer stored in the any
object.
The ability to choose one of two implementations at run-time and call a specific method from a pre-defined list of methods is essentially a manual implementation of a virtual table. I'm wondering why it was implemented this way. Wouldn't it have been easier to simply store a pointer to a virtual type?
I couldn't find any information about the reasons for this implementation. Thinking about it, I guess using a virtual class is sub-optimal in two ways:
- It needs an object instance and managing a singleton, whereas in reality a vtable (without an instance) is enough.
- Calling a function on
any
would involve two indirections: first through the pointer stored inany
to get the vtable, then through the pointer stored in the vtable. I'm not sure if the performance of this is any different to theswitch
-based approach above.
Are these the reasons for using an implementation based on switch
-ing op-codes? Is there any other major advantage of the current implementation? Do you know of a link to general information about this technique?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
考虑
std :: Any
的典型用例:您将其传递给代码,将其移动数十次,将其存储在数据结构中并稍后再次获取。特别是,您可能会经常从函数返回它。如今,指向单个“ Do Alovy”功能的指针位于
Any
中的数据旁边。鉴于这是一种相当小的类型(GCC x86-64上的16个字节),任何
都适合一对寄存器。现在,如果您从函数中返回任何
,则任何
的“执行所有操作”函数的指针已经在寄存器或堆栈中!您可以直接跳到它而无需从内存中获取任何东西。最有可能的是,您甚至根本不必触摸内存:您知道在构造它时
中的哪种类型,因此函数指针值只是一个常数,该常数已加载到适当的注册。稍后,您将寄存器的值用作跳跃目标。这意味着没有机会对跳跃进行错误预测的机会,因为没有什么可以预测的,CPU的值就在那里。换句话说:通过此实现免费获得跳跃目标的原因是,CPU首先已经以某种方式触摸了
任何
,这意味着它已经知道了跳跃目标可以跳到它的情况下,没有额外的延迟。这意味着,如果
任何
已经“热”,那么在当前的实现中确实没有间接说明,这大部分时间都在于它,尤其是当它用作返回值时。另一方面,如果您在只读部分中使用函数表指针(并让任何实例指向这一点),则必须转到内存(或缓存)每次您都想移动或访问它。在这种情况下,任何的大小仍然是16个字节,但是从内存中获取值的速度比访问寄存器中的值要慢得多,尤其是如果它不在缓存中。在许多情况下,移动
任何
都与将其16个字节从一个位置复制到另一个位置一样简单,然后将原始实例归零。这在任何现代CPU上都是免费的。但是,如果您进入指针表路线,则每次都必须从内存中获取,等待读取完成,然后进行间接调用。现在考虑一下,您通常必须对任何
(即移动,然后破坏)进行一系列通话,这将很快加起来。问题在于,您不仅会在每次触摸任何
> 时免费获得要免费跳到的函数的地址,CPU必须明确获取它。间接跳到从内存中读取的值很昂贵,因为CPU只能在整个内存操作完成后退休跳跃操作。这不仅包括获取一个值(由于缓存而可能很快),还可以解决生成,存储转发查找,TLB查找,访问验证,甚至可能是Page Table Walks。因此,即使快速计算了跳跃地址,跳跃也不会退休很长时间。一般而言,“间接跳动到地址 - 从内存”操作是CPU管道可能发生的最糟糕的事情之一。TL; DR:正如现在的那样,返回
任何
都不会停滞CPU的管道(跳跃目标已经在寄存器中可用,因此跳跃可以立即退休)。使用基于表的解决方案,返回任何
将两次 :一次获取移动函数的地址,然后再获取驱动器的时间。这将延迟跳跃的退休,因为它不仅需要等待内存值,还需要等待TLB和访问权限检查。另一方面,代码存储器访问不受此影响,因为无论如何代码以微模式形式保存(在µOP缓存中)。因此,在该开关语句中获取和执行一些有条件的分支非常快(甚至是当分支预测器变得正确时,几乎总是这样做时)。
Consider a typical use case of a
std::any
: You pass it around in your code, move it dozens of times, store it in a data structure and fetch it again later. In particular, you'll likely return it from functions a lot.As it is now, the pointer to the single "do everything" function is stored right next to the data in the
any
. Given that it's a fairly small type (16 bytes on GCC x86-64),any
fits into a pair of registers. Now, if you return anany
from a function, the pointer to the "do everything" function of theany
is already in a register or on the stack! You can just jump directly to it without having to fetch anything from memory. Most likely, you didn't even have to touch memory at all: You know what type is in theany
at the point you construct it, so the function pointer value is just a constant that's loaded into the appropriate register. Later, you use the value of that register as your jump target. This means there's no chance for misprediction of the jump because there is nothing to predict, the value is right there for the CPU to consume.In other words: The reason that you get the jump target for free with this implementation is that the CPU must have already touched the
any
in some way to obtain it in the first place, meaning that it already knows the jump target and can jump to it with no additional delay.That means there really is no indirection to speak of with the current implementation if the
any
is already "hot", which it will be most of the time, especially if it's used as a return value.On the other hand, if you use a table of function pointers somewhere in a read-only section (and let the
any
instance point to that instead), you'll have to go to memory (or cache) every single time you want to move or access it. The size of anany
is still 16 bytes in this case but fetching values from memory is much, much slower than accessing a value in a register, especially if it's not in a cache. In a lot of cases, moving anany
is as simple as copying its 16 bytes from one location to another, followed by zeroing out the original instance. This is pretty much free on any modern CPU. However, if you go the pointer table route, you'll have to fetch from memory every time, wait for the reads to complete, and then do the indirect call. Now consider that you'll often have to do a sequence of calls on theany
(i.e. move, then destruct) and this will quickly add up. The problem is that you don't just get the address of the function you want to jump to for free every time you touch theany
, the CPU has to fetch it explicitly. Indirect jumps to a value read from memory are quite expensive since the CPU can only retire the jump operation once the entire memory operation has finished. That doesn't just include fetching a value (which is potentially quite fast because of caches) but also address generation, store forwarding buffer lookup, TLB lookup, access validation, and potentially even page table walks. So even if the jump address is computed quickly, the jump won't retire for quite a long while. In general, "indirect-jump-to-address-from-memory" operations are among the worst things that can happen to a CPU's pipeline.TL;DR: As it is now, returning an
any
doesn't stall the CPU's pipeline (the jump target is already available in a register so the jump can retire pretty much immediately). With a table-based solution, returning anany
will stall the pipeline twice: Once to fetch the address of the move function, then another time to fetch the destructor. This delays retirement of the jump quite a bit since it'll have to wait not only for the memory value but also for the TLB and access permission checks.Code memory accesses, on the other hand, aren't affected by this since the code is kept in microcode form anyway (in the µOp cache). Fetching and executing a few conditional branches in that switch statement is therefore quite fast (and even more so when the branch predictor gets things right, which it almost always does).