速度比较 - 模板专门化与虚拟函数与 If 语句

发布于 2024-09-01 03:12:32 字数 939 浏览 9 评论 0原文

只是为了摆脱它......

过早的优化是万恶之源

利用OOP

等等

我明白。只是寻找一些关于某些操作速度的建议,我可以将其存储在我的灰质中以供将来参考。

假设您有一个动画课程。动画可以循环(一遍又一遍地播放)或不循环(播放一次),它可能具有或不具有唯一的帧时间,等等。假设有 3 个“非此即彼”属性。 请注意,Animation 类的任何方法最多都会检查其中之一(即这不是 if-elseif 的巨大分支的情况)。

以下是一些选项。

1) 为上面给出的属性赋予它布尔成员,并在播放动画时使用 if 语句检查它们以执行适当的操作。

  • 问题:每次播放动画时都会进行条件检查。

2)创建一个基本动画类,并派生其他动画类,例如 LoopedAnimation 和 AnimationUniqueFrames 等。

  • 问题:假设您有类似 vector 的内容,则 Vtable 检查每次调用以播放动画。此外,为所有可能的组合创建一个单独的类似乎代码臃肿。

3) 使用模板特化,并特化那些依赖于这些属性的函数。就像template一样类动画

  • 问题:这个问题是你不能只用一个vector来表示某些动画。也可能是浮肿。

我想知道这些选项各自提供什么样的速度?我对第一个和第二个选项特别感兴趣,因为第三个选项不允许迭代动画的通用容器。

简而言之,什么更快 - vtable 获取还是条件?

Just to get it out of the way...

Premature optimization is the root of all evil

Make use of OOP

etc.

I understand. Just looking for some advice regarding the speed of certain operations that I can store in my grey matter for future reference.

Say you have an Animation class. An animation can be looped (plays over and over) or not looped (plays once), it may have unique frame times or not, etc. Let's say there are 3 of these "either or" attributes. Note that any method of the Animation class will at most check for one of these (i.e. this isn't a case of a giant branch of if-elseif).

Here are some options.

1) Give it boolean members for the attributes given above, and use an if statement to check against them when playing the animation to perform the appropriate action.

  • Problem: Conditional checked every single time the animation is played.

2) Make a base animation class, and derive other animations classes such as LoopedAnimation and AnimationUniqueFrames, etc.

  • Problem: Vtable check upon every call to play the animation given that you have something like a vector<Animation>. Also, making a separate class for all of the possible combinations seems code bloaty.

3) Use template specialization, and specialize those functions that depend on those attributes. Like template<bool looped, bool uniqueFrameTimes> class Animation.

  • Problem: The problem with this is that you couldn't just have a vector<Animation> for something's animations. Could also be bloaty.

I'm wondering what kind of speed each of these options offer? I'm particularly interested in the 1st and 2nd option because the 3rd doesn't allow one to iterate through a general container of Animations.

In short, what is faster - a vtable fetch or a conditional?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

眼眸 2024-09-08 03:12:32

(1) 如今,生成的程序集的大小不再重要,但这就是它生成的内容(大约,假设 x86 上的 MSVC):

mov eax, [ecx+12]   ; 'this' pointer stored in ecx, eax is scratch
cmp eax, 0          ; test for 0 
jz  .somewhereElse  ; jump if the bool isn't set

优化编译器将在那里散布其他指令,使其更加适合管道。无论如何,您的类的内容很可能会在您的缓存中,如果不是,则无论如何都会在几个周期后需要它。所以,回想起来,这可能是几个周期,对于每帧最多调用几次的东西来说,这没什么。

(2) 这大约是每次调用 play() 方法时生成的程序集:

mov  eax, [ebp+4]    ; pointer to your Animation* somewhere on the stack, eax is scratch
mov  eax, [eax+12]   ; dereference the vtable
call eax             ; call it

然后,您的专用 play() 函数中将有一些重复的代码或另一个函数调用,因为肯定会有一些重复的代码或另一个函数调用。常见的东西,因此会产生一些开销(代码大小和/或执行速度)。所以,这肯定会慢一些。

此外,这使得加载通用动画变得更加困难。你的图形部门不会高兴的。

(3) 为了有效地使用它,您最终将使用虚拟函数为您的模板化版本创建一个基类(在这种情况下,请参阅 (2)),或者您将通过检查以下位置的类型来手动完成此操作:你称你的动画为事物,在这种情况下也请参见(2)。

这也使得加载通用动画变得更加困难。您的图形部门会更不高兴。

(4) 你需要担心的不是对一帧最多执行几次的小事情进行一些微优化。通过阅读您的文章,我实际上发现了另一个经常被忽视的问题。您提到的是 std::vector。并不是反对 STL,但那是糟糕的巫毒。在应用程序运行的整个过程中,单个内存分配将比 play() 或 update() 方法中的所有布尔检查花费更多的周期。将动画放入和取出 std::vectors(特别是如果您放入实例而不是指向实例的指针(智能或愚蠢))将会花费更多。

你需要寻找不同的地方来优化。这是一个非常荒谬的微优化,除了让你的图形部门更难推广之外,不会给你带来任何好处。然而,重要的是担心内存分配,然后,当您完成该部分的编程时,启动分析器并查看热点在哪里。

如果保持动画实际上成为瓶颈,那么 std::vector (虽然很好)就是您可能想要查看的地方。您是否看过侵入性链表?这实际上比担心这个更有好处。

(1) Not that the size of the generated assembly matters anymore these days, but this is what it generates (approximately, assuming MSVC on x86):

mov eax, [ecx+12]   ; 'this' pointer stored in ecx, eax is scratch
cmp eax, 0          ; test for 0 
jz  .somewhereElse  ; jump if the bool isn't set

The optimizing compiler will intersperse other instructions there, making it more pipeline-friendly. The contents of your class will most likely be in your cache anyway, and if it's not, it will be needed a few cycles later anyway. So, in retrospect, that's maybe a few cycles, and for something that will be called at most a few times per frame, that's nothing.

(2) This is approximately the assembly that will be generated every time your play() method is called:

mov  eax, [ebp+4]    ; pointer to your Animation* somewhere on the stack, eax is scratch
mov  eax, [eax+12]   ; dereference the vtable
call eax             ; call it

Then, you'll have some duplicate code or another function call inside your specialized play() function, since there'll definetely be some common stuff, so that incurs some overhead (in code size and/or execution speed). So, this is definetely slower.

Also, this makes it alot harder to load generic animations. Your graphics department won't be happy.

(3) To use this effectively, you'll end up making a base class for your templated version anyway, with virtual functions (in that case, see (2)), OR you'll do it manually by checking types in places where you call your animation thing, in which case also see (2).

This also makes it MUCH harder to load generic animations. Your graphics department will be even less happy.

(4) What you need to worry about is not some microoptimization for tiny things done at most a few times a frame. From reading your post, i actually identified another problem that's commonly overlooked. You're mentioning std::vector<Animation>. Nothing against the STL, but that's bad voodoo. A single memory allocation will cost you more cycles than all the boolean checks in your play() or update() methods for probably the entire time your application is running. Putting Animations in and out of std::vectors (especially if you're putting in instances and not pointers (smart or dumb) to instances) will cost you way more.

You need to look at different places to optimize. This is such a ridiculous microoptimization that will bring you no benefit except make it harder to generalize and make your graphics department happy. What will matter, however, is worrying about memory allocation, and THEN, when you're done programming that part, starting a profiler and looking where the hot spots are.

If keeping your animations is actually becoming a bottleneck, the std::vector (nice as it is) is where you might want to look. Have you looked at, say, an intrusive linked list? That will actually be more benefit than worrying about this.

明媚如初 2024-09-08 03:12:32

(为简洁起见,进行了编辑。)

编译器、CPU 和操作系统都可以更改答案,此处:

  • CPU:指令/数据缓存大小、体系结构和行为,尤其是任何智能预取
  • CPU:分支预测和推测执行行为
  • CPU:惩罚对于错误预测的分支
  • 编译器和 CPU:条件执行指令的可用性和相对成本(有助于仅覆盖少数指令的分支情况)
  • 编译器或链接器:可能会转换代码并删除分支的优化

简而言之,正如 Blindy 中所说评论: 测试一下。 =)

如果您正在为现代桌面操作系统或操作系统编写内容,请寻求分析工具(valgrind、shark、codeanalyst、vtune 等)的帮助——它可能会为您提供您从未知道可以查找的详细信息,例如 。

即使您没有找到很好的答案,您也可以通过应用该工具学到一些东西 我经常发现查看反汇编也很有启发性(请参阅本线程中的一些其他答案)。

一些稍微更具推测性的注释:

  • vtable 往往会导致加载 (this+0)、偏移、第二次加载,然后对寄存器的内容进行分支。您可以在其他一些答案中看到这一点。我熟悉的大多数 CPU 在从寄存器预测分支方面都表现不佳。
  • 该布尔值可能靠近您正在使用的其他数据,因此可能已经被缓存。分支目标也可能是固定的,因此对于预测和/或推测执行更加友好。
  • 在某些处理器上(现在很少见),加载 bool 的成本比加载 int 的成本更高。
  • 在我使用的 ARM 处理器上,我们偶尔会将 vtable 塞进处理器核心上的“紧耦合内存”中。显着减少间接加载时间——就好像 vtable 始终位于缓存中或更好。

正如您所提到的,通常的规则适用:首先做适合需求且灵活/可维护/可读的事情,然后进行优化。

进一步阅读/其他要追求的模式:

“面向数据的设计”和“基于组件的实体”范式对于保留在您的大脑中的游戏、多媒体引擎和其他您拥有更多能力的东西都很有用。对性能的需求高于平均水平,并且仍然希望使代码保持一定的组织性。 YMMV,当然。 =)

(Edited for brevity.)

The compiler, CPU, and OS all can change the answer, here:

  • CPU: instruction/data cache size, architecture, and behavior, especially any intelligent prefetch
  • CPU: branch prediction and speculative execution behavior
  • CPU: the penalty for a mispredicted branch
  • compiler and CPU: the availability and relative cost of conditionally-executed instructions (helps with branch cases that only cover a few instructions)
  • compiler or linker: optimizations that may transform your code and remove branches

In short, as Blindy said in the comments: test it. =)

If you're writing for a modern desktop OS or OSes, enlist the help of a profiling tool (valgrind, shark, codeanalyst, vtune, etc) -- it may give you details you never even knew you could look for, such as cache misses, branch mispredicts, etc.

Even if you don't find a great answer, you'll learn something from applying the tool. I often find looking at the disassembly quite instructive, too (see some of the other answers in this thread).

Some slightly more speculative notes:

  • vtable tends to result in a load (this+0), offset, second load, and then branch on the contents of the register. You can see this in some of the other answers. Most CPUs that I'm familiar with are miserable at predicting branches from registers.
  • the bool may be near other data you're using and as such may already be cached. The branch target is also likely to be fixed and therefore a lot more friendly for prediction and/or speculative execution.
  • on some processors (rarer these days), it costs more to load a bool than an int.
  • on an ARM processor I work with, we occasionally tuck the vtables in "tightly coupled memory" on the processor core. Decreases the indirect load time considerably -- it's as if the vtable is always in-cache or better.

As you mentioned, the usual rule applies: do what fits requirements and is flexible/maintainable/readable first, then optimize.

Further reading / other patterns to pursue:

Both the "Data Oriented Design" and the "Component-Based Entity" paradigms are useful to keep in your brain for games, multimedia engines, and other things where you have a greater-than-average demand for performance and still want to keep your code somewhat organized. YMMV, of course. =)

染墨丶若流云 2024-09-08 03:12:32

Vtable 非常非常快。简单的条件句也是如此。它们转换为 CPU 指令的个位数。担心这种性能会让您陷入编译器优化的浑水,您根本不了解编译器在做什么。很可能,程序中非常细微的变化可以胜过 if 语句和 vtable 之间的微小差异。

我做了一个小测试 不久前< /a> 测试 RTTI 多重调度和 vtable 之间的差异。在发布模式下,三个对象之间的调度(两个 vtable 调用)完成超过 200 万次迭代需要 62 毫秒。这根本不值得担心。

Vtable is very very fast. So are simple conditionals. They translate to single digits of CPU instructions. Worrying about this kind of performance gets you in the murky waters of compiler optimisations, where you don't at all understand what the compiler is doing. Chances are, very subtle changes in your program can trump the minute differences between an if statement and a vtable.

I did a little test a while ago testing differences between RTTI multiple dispatch and vtable. In release mode a dispatch between three objects (two vtable calls) done over two million iterations take 62 milliseconds. That is way way not even worth worrying about.

鹊巢 2024-09-08 03:12:32

谁说#3 不可能拥有通用的动画容器?人们可以使用多种方法。它们确实都归结为最终进行多态调用,但选项就在那里。考虑一下:

std::vector<boost::any> generic_container;
function(generic_container[0]);

void function(boost::any & a)
{
  my_operation::execute(a.type().name(), a);
}

my_operation 只需要一种按类型名称注册和过滤操作的方法。它搜索对 a 代表的任何内容进行操作的函子,并使用它。然后函子将any_casts到适当的时间并执行特定于类型的操作。

或者使用访问者框架。上面是一种变体,但水平太笼统,无法真正合格。

而且还有更多可能的方法。您可以存储隐藏细节并在激活时执行正确的视图选项的类型,而不是存储动画。一种虚拟被称为,但它特定于切换彼此执行更复杂操作的具体类型。

换句话说,你的问题没有一般答案。根据您的需要,您可以达到各种复杂程度,使几乎整个程序在编译时而不是运行时具有多态性。

Who says #3 makes it impossible to have a generic container of animations? There are several approaches one can use. They do all boil down to eventually making a polymorphic call but the options are there. Consider this:

std::vector<boost::any> generic_container;
function(generic_container[0]);

void function(boost::any & a)
{
  my_operation::execute(a.type().name(), a);
}

my_operation just needs to have a way of registering and filtering operations by type name. It searches for a functor that operates on whatever a represents, and uses it. The functor then any_casts to the appropriate time and does the type specific operation.

Or use a visitor framework. The above is sort of a variation of that but at too generic a level to really qualify.

And there are more possible methods. Instead of storing animations you could store a type that hides the specifics and executes the correct view options when activated. One virtual is called but it is specific to switching out concrete types that do more complex operations on each other.

There is no general answer to your question in other words. Depending on what you need you could reach all kinds of levels of complexity to make almost your entire program compile time polymorphic as opposed to run-time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文