高效的操作员+
我必须计算 3d 向量的大量总和,并且使用具有重载运算符 + 和运算符 * 的向量类与单独组件的求和进行比较,显示性能差异约为三倍。我知道假设差异一定是由于重载运算符中对象的构造造成的。
如何避免构造并提高性能?
我特别困惑,因为下面是基本上的标准方法,我希望编译器能够优化它。在现实生活中,求和不会在循环内完成,而是在相当大的表达式中(预可执行文件总共有几十MB)对不同的向量求和,这就是下面使用operator+的原因。
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
非常感谢任何帮助。
编辑: 感谢大家的宝贵意见,我现在的表现处于同一水平。 @Dima,尤其是@Xeo 的回答成功了。我希望我可以将多个答案标记为“已接受”。我也会测试其他一些建议。
I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I know assume the difference must be due to construction of objects in the overloaded operators.
How can one avoid the construction and improve performance?
I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
Any help is greatly appreciated.
EDIT:
Thank you all for your great input, I have the performance now at the same level.
@Dima's and especially @Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
这篇文章对如何优化
+
、-
、*
、/
等运算符。将
operator+
实现为一个自由函数,就像按照operator+=
一样:注意
lhs
Vector 是一个副本而不是引用。这允许编译器进行优化,例如复制省略。文章传达的一般规则是:如果您需要副本,请在参数中进行,以便编译器可以优化。本文没有使用此示例,而是使用用于复制和交换习惯用法的
operator=
。This article has a really good argumentation on how to optimize operators such as
+
,-
,*
,/
.Implement the
operator+
as a free function like this in terms ofoperator+=
:Notice on how the
lhs
Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the
operator=
for the copy-and-swap idiom.为什么不替换
为
... 这应该消除每次迭代对复制构造函数的两次调用和对赋值运算符的一次调用。
但一如既往,分析并了解费用来自何处,而不是猜测。
Why not replace
with
... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.
But as always, profile and know where the expense is coming instead of guessing.
您可能对表达式模板感兴趣。
You might be interested in expression templates.
我实现了这里提出的大部分优化,并将其与函数调用的性能进行了比较,例如
以交替顺序对每个方法重复执行相同的循环并进行数十亿个向量求和,并没有带来预期的收益。
对于 bbtrb 发布的成员函数,此方法比
isSumOf()
函数调用多花费 50% 的时间。免费的非成员运算符 + (Xeo) 方法所需的时间是 is
SumOf()
函数的两倍(多 100%)。我知道这不是一个具有代表性的测试,但因为我根本无法通过使用运算符来重现任何性能提升。如果可能的话,我建议避免使用它们。
I implemented most of the optimizations being proposed here and compared it with the performance of a function call like
Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.
In case of the member function posted by bbtrb, this method took 50% more time than the
isSumOf()
function call.Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is
SumOf()
function.I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.
通常,运算符 + 看起来像:
具有适当定义的构造函数。这允许编译器进行返回值优化。
但如果您正在针对 IA32 进行编译,那么 SIMD 就值得考虑,同时还需要更改算法以利用 SIMD 的特性。其他处理器可能具有 SIMD 样式指令。
Usually, operator + looks like:
with a suitably defined constructor. This allows the compiler to do return value optimisation.
But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.
我认为性能的差异是由这里的编译器优化引起的。在循环中添加数组元素可以由编译器进行矢量化。现代 CPU 具有在单个时钟周期内添加多个数字的指令,例如 SSE、SSE2 等。这似乎是您所看到的 3 倍差异的可能解释。
换句话说,在循环中添加两个数组的对应元素通常可能比添加类的对应成员更快。如果您将向量表示为类中的数组,而不是 x、y 和 z,则重载运算符可能会获得相同的加速。
I think the difference in performance is caused by the compiler optimization here. Adding up elements of arrays in a loop can be vectorized by the compiler. Modern CPUs have instructions for adding multiple numbers in a single clock tick, such as SSE, SSE2, etc. This seems to be a likely explanation for the factor of 3 difference that you are seeing.
In other words, adding corresponding elements of two arrays in a loop may generally be faster than adding corresponding members of a class. If you represent the vector as an array inside your class, rather than x, y, and z, you may get the same speedup for your overloaded operators.
Vector 运算符函数的实现是直接在头文件中还是在单独的 cpp 文件中?在头文件中,它们通常会内联到优化构建中。但如果它们是在不同的翻译单元中编译的,那么它们通常不会(取决于您的构建设置)。如果函数未内联,则编译器将无法执行您正在寻找的优化类型。
遇到此类情况,请查看反汇编。即使您对汇编代码不太了解,通常也很容易找出像这样的简单情况下的不同之处。
Are the implementations to your Vector operator functions directly in the header file or are they in a separate cpp file? In the header file they would typically be inlined in an optimized build. But if they are compiled in a different translation unit, then they often won't be (depending on your build settings). If the functions aren't inlined, then the compiler won't be able to do the type of optimization you are looking for.
In cases like these, have a look at the disassembly. Even if you don't know much about assembly code it's usually pretty easy to figure out what's different in simple cases like these.
实际上,如果您查看任何真正的矩阵代码,运算符+和运算符+=都不会这样做。
由于涉及复制,它们在表达式中引入了伪对象,并且仅在执行赋值时才执行真正的工作。使用像这样的惰性求值还允许在表达式求值期间删除 NULL 操作:
当然,这比我在这个简化示例中描述的要复杂得多。但是,如果您使用专为矩阵运算设计的库,那么它已经为您完成了。
Actually if you look at any real matrix code the operator+ and the operator+= don't do that.
Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:
Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.
您的 Vector 实现:
像这样实现
operator+()
:并在类定义中添加
inline
运算符(这可以避免返回地址和方法参数的堆栈推送和弹出)对于每个方法调用,如果编译器发现它有用)。然后添加这个构造函数:
它可以让您非常有效地构造一个新向量(就像您在我的
operator+()
建议中所做的那样)!在使用您的向量的代码中:
您做了:
展开这种循环!在非常大循环中仅执行一个操作(因为它将被优化为使用SSE2/3扩展或类似的东西),效率非常低。你应该这样做:(
请注意,此代码未经测试,可能包含“逐一”错误左右......)
Your Vector implementation:
Implement the
operator+()
like this:and add the
inline
operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).Then add this constructor:
which lets you construct a new vector very efficiently (like you would do in my
operator+()
suggestion)!In the code using your Vector:
You did:
Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:
(Note that this code is untested and may contain a "off-by-one"-error or so...)
请注意,您问的是不同的事情,因为数据在内存中的处理方式不同。使用向量数组时,坐标交错为“x1,y1,z1,x2,y2,z2,...”,而使用双数组时,坐标为“x1,x2,...,y1,y2,...z1” ,z2...”。我认为这可能会对编译器优化或缓存处理方式产生影响。
Note that you are asking different things because the data is not disposed in the same way in memory. When using Vector array the coordinates are interleaved "x1,y1,z1,x2,y2,z2,...", while with the double arrays you have "x1,x2,...,y1,y2,...z1,z2...". I suppose this could have an impact on compiler optimizations or how the caching handles it.