向量、矩阵和四元数的缓存性能
我过去多次注意到,C 和 C++ 代码对这些结构使用以下格式:
class Vector3
{
float components[3];
//etc.
}
class Matrix4x4
{
float components[16];
//etc.
}
class Quaternion
{
float components[4];
//etc.
}
我的问题是,这会导致比这样更好的缓存性能吗:
class Quaternion
{
float x;
float y;
float z;
//etc.
}
...因为我无论如何,假设类成员和函数位于连续的内存空间中?我目前使用后一种形式,因为我发现它更方便(但是我也可以看到数组形式的实际意义,因为它允许人们根据正在执行的操作将轴视为任意的)。
在听取了受访者的一些建议后,我测试了差异,实际上阵列的速度更慢——帧速率大约有 3% 的差异。我实现了operator[]来将数组访问包装在Vector3中。不确定这是否与此有关,但我对此表示怀疑,因为无论如何都应该内联。我能看到的唯一因素是我无法再在 Vector3(x, y, z)
上使用构造函数初始值设定项列表。然而,当我采用原始版本并将其更改为不再使用构造函数初始化列表时,它的运行速度比以前慢了很多(小于 0.05%)。没有线索,但至少现在我知道原来的方法更快。
I've noticed on a number of occasions in the past, C and C++ code that uses the following format for these structures:
class Vector3
{
float components[3];
//etc.
}
class Matrix4x4
{
float components[16];
//etc.
}
class Quaternion
{
float components[4];
//etc.
}
My question is, will this lead to any better cache performance than say, this:
class Quaternion
{
float x;
float y;
float z;
//etc.
}
...Since I'd assume the class members and functions are in contiguous memory space, anyway? I currently use the latter form because I find it more convenient (however I can also see the practical sense in the array form, since it allows one to treat axes as arbitrary dependant on the operation being performed).
Afer taking some advice from the respondents, I tested the difference and it is actually slower with the array -- I get about 3% difference in framerate. I implemented operator[] to wrap the array access inside the Vector3. Not sure if this has anything to do with it, but I doubt it since that should be inlined anyway. The only factor I could see was that I could no longer use a constructor initializer list on Vector3(x, y, z)
. However when I took the original version and changed it to no longer use constructor initialiser lists, it ran very marginally slower than before (less than 0.05%). No clue, but at least now I know the original approach was faster.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这些声明在内存布局方面并不等效。
上面保证了元素在内存中是连续的,而如果它们是像上一个示例一样的单独成员,则允许编译器在它们之间插入填充(例如将成员与某些地址模式对齐)。
这是否会导致更好或更差的性能取决于您的编译器,因此您必须对其进行分析。
These declarations are not equivalent with respect to memory layout.
The above guarantees that the elements are continuous in memory, while, if they are individual members like in your last example, the compiler is allowed to insert padding between them (for instance to align the members with certain address-patterns).
Whether or not this results in better or worse performance depends on your mostly on your compiler, so you'd have to profile it.
我想像这样的优化的性能差异是最小的。我想说,对于大多数代码来说,这样的事情属于过早优化。但是,如果您计划对结构进行矢量处理(例如使用 CUDA),则结构组合会产生重要的差异。如果有兴趣,请查看第 23 页:http://www.eecis .udel.edu/~mpellegr/eleg662-09s/li.pdf
I imagine the performance difference from an optimization like this is minimal. I would say something like this falls into premature optimization for most code. However, if you plan to do vector processing over your structs, say by using CUDA, struct composition makes an important difference. Look at page 23 on this if interested: http://www.eecis.udel.edu/~mpellegr/eleg662-09s/li.pdf
我不确定在这种情况下使用数组时编译器是否能够更好地优化代码(例如考虑联合),但是当使用像 OpenGL 这样的 API 时,在调用像这样的函数
而不是调用
时它可以是一种优化,因为,在在后面的情况下,每个参数都按值传递,而在第一个示例中,仅传递指向整个数组的指针,函数可以决定复制什么以及何时复制,这样可以减少不必要的复制操作。
I am not sure if the compiler manages to optimize code better when using an array in this context (think at unions for example), but when using APIs like OpenGL, it can be an optimisation when calling functions like
instead of calling
because, in the later case, each parameter is passed by value, whereas in the first example, only a pointer to the whole array is passed and the function can decide what to copy and when, this way reducing unnecessary copy operations.