高效的操作员+

发布于 2024-10-18 06:08:00 字数 1139 浏览 1 评论 0原文

我必须计算 3d 向量的大量总和，并且使用具有重载运算符 + 和运算符 * 的向量类与单独组件的求和进行比较，显示性能差异约为三倍。我知道假设差异一定是由于重载运算符中对象的构造造成的。

如何避免构造并提高性能？

我特别困惑，因为下面是基本上的标准方法，我希望编译器能够优化它。在现实生活中，求和不会在循环内完成，而是在相当大的表达式中（预可执行文件总共有几十MB）对不同的向量求和，这就是下面使用operator+的原因。

class Vector 
{
   double x,y,z;
   ...
   Vector&  
   Vector::operator+=(const Vector &v)
   {
       x += v.x;
       y += v.y;
       z += v.z;
       return *this;
   }

   Vector  
   Vector::operator+(const Vector &v)
   {
       return Vector(*this) += v; // bad: construction and copy(?)
   }

   ...
}

// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];

// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
    sum = sum + vec[i];
}

// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
    sumxx = sumxx + xx[i];
    sumyy = sumyy + yy[i];
    sumzz = sumzz + zz[i];
}

非常感谢任何帮助。

编辑： 感谢大家的宝贵意见，我现在的表现处于同一水平。 @Dima，尤其是@Xeo 的回答成功了。我希望我可以将多个答案标记为“已接受”。我也会测试其他一些建议。

原文

I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I ~~know~~ assume the difference must be due to construction of objects in the overloaded operators.

How can one avoid the construction and improve performance?

I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.

class Vector 
{
   double x,y,z;
   ...
   Vector&  
   Vector::operator+=(const Vector &v)
   {
       x += v.x;
       y += v.y;
       z += v.z;
       return *this;
   }

   Vector  
   Vector::operator+(const Vector &v)
   {
       return Vector(*this) += v; // bad: construction and copy(?)
   }

   ...
}

// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];

// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
    sum = sum + vec[i];
}

// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
    sumxx = sumxx + xx[i];
    sumyy = sumyy + yy[i];
    sumzz = sumzz + zz[i];
}

Any help is greatly appreciated.

EDIT:
Thank you all for your great input, I have the performance now at the same level.
@Dima's and especially @Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巡山小妖精 2024-10-25 06:08:00

这篇文章对如何优化+、-、*、/等运算符。
将 operator+ 实现为一个自由函数，就像按照 operator+= 一样：

Vector operator+(Vector lhs, Vector const& rhs){
    return lhs += rhs;
}

注意 lhs Vector 是一个副本而不是引用。这允许编译器进行优化，例如复制省略。
文章传达的一般规则是：如果您需要副本，请在参数中进行，以便编译器可以优化。本文没有使用此示例，而是使用用于复制和交换习惯用法的 operator=。

This article has a really good argumentation on how to optimize operators such as +, -, *, /.
Implement the operator+ as a free function like this in terms of operator+=:

Vector operator+(Vector lhs, Vector const& rhs){
    return lhs += rhs;
}

Notice on how the lhs Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.
The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the operator= for the copy-and-swap idiom.

回复收藏 0 原文

昔梦 2024-10-25 06:08:00

为什么不替换

sum = sum + vec[i];

为

sum += vec[i];

... 这应该消除每次迭代对复制构造函数的两次调用和对赋值运算符的一次调用。

但一如既往，分析并了解费用来自何处，而不是猜测。

Why not replace

sum = sum + vec[i];

with

sum += vec[i];

... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.

But as always, profile and know where the expense is coming instead of guessing.

回复收藏 0 原文

可爱咩 2024-10-25 06:08:00

您可能对表达式模板感兴趣。

回复收藏 0 原文

救星 2024-10-25 06:08:00

我实现了这里提出的大部分优化，并将其与函数调用的性能进行了比较，例如

Vector::isSumOf( Vector v1, Vector v2)
{
    x = v1.x + v2.x;
    ...
}

以交替顺序对每个方法重复执行相同的循环并进行数十亿个向量求和，并没有带来预期的收益。

对于 bbtrb 发布的成员函数，此方法比 isSumOf() 函数调用多花费 50% 的时间。

免费的非成员运算符 + (Xeo) 方法所需的时间是 is SumOf() 函数的两倍（多 100%）。

(gcc 4.6.3 -O3)

我知道这不是一个具有代表性的测试，但因为我根本无法通过使用运算符来重现任何性能提升。如果可能的话，我建议避免使用它们。

I implemented most of the optimizations being proposed here and compared it with the performance of a function call like

Vector::isSumOf( Vector v1, Vector v2)
{
    x = v1.x + v2.x;
    ...
}

Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.

In case of the member function posted by bbtrb, this method took 50% more time than the isSumOf() function call.

Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is SumOf() function.

(gcc 4.6.3 -O3)

I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.

回复收藏 0 原文

帥小哥 2024-10-25 06:08:00

通常，运算符 + 看起来像：

return Vector (x + v.x, y + v.y, z + v.z);

具有适当定义的构造函数。这允许编译器进行返回值优化。

但如果您正在针对 IA32 进行编译，那么 SIMD 就值得考虑，同时还需要更改算法以利用 SIMD 的特性。其他处理器可能具有 SIMD 样式指令。

Usually, operator + looks like:

return Vector (x + v.x, y + v.y, z + v.z);

with a suitably defined constructor. This allows the compiler to do return value optimisation.

But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.

回复收藏 0 原文

忘年祭陌 2024-10-25 06:08:00

我认为性能的差异是由这里的编译器优化引起的。在循环中添加数组元素可以由编译器进行矢量化。现代 CPU 具有在单个时钟周期内添加多个数字的指令，例如 SSE、SSE2 等。这似乎是您所看到的 3 倍差异的可能解释。

换句话说，在循环中添加两个数组的对应元素通常可能比添加类的对应成员更快。如果您将向量表示为类中的数组，而不是 x、y 和 z，则重载运算符可能会获得相同的加速。

回复收藏 0 原文

笑脸一如从前 2024-10-25 06:08:00

Vector 运算符函数的实现是直接在头文件中还是在单独的 cpp 文件中？在头文件中，它们通常会内联到优化构建中。但如果它们是在不同的翻译单元中编译的，那么它们通常不会（取决于您的构建设置）。如果函数未内联，则编译器将无法执行您正在寻找的优化类型。

遇到此类情况，请查看反汇编。即使您对汇编代码不太了解，通常也很容易找出像这样的简单情况下的不同之处。

回复收藏 0 原文

始于初秋 2024-10-25 06:08:00

实际上，如果您查看任何真正的矩阵代码，运算符+和运算符+=都不会这样做。

由于涉及复制，它们在表达式中引入了伪对象，并且仅在执行赋值时才执行真正的工作。使用像这样的惰性求值还允许在表达式求值期间删除 NULL 操作：

class Matrix;
class MatrixOp
{
    public: virtual void DoOperation(Matrix& resultInHere) = 0;
};

class Matrix
{
    public:
        void operator=(MatrixOp* op)
        {
            // No copying has been done.
            // You have built an operation tree.
            // Now you are goign to evaluate the expression and put the
            // result into *this
            op->DoOperation(*this);
        }
        MatrixOp* operator+(Matrix& rhs)  { return new MatrixOpPlus(*this,rhs);}
        MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
        // etc
};

当然，这比我在这个简化示例中描述的要复杂得多。但是，如果您使用专为矩阵运算设计的库，那么它已经为您完成了。

Actually if you look at any real matrix code the operator+ and the operator+= don't do that.

Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:

class Matrix;
class MatrixOp
{
    public: virtual void DoOperation(Matrix& resultInHere) = 0;
};

class Matrix
{
    public:
        void operator=(MatrixOp* op)
        {
            // No copying has been done.
            // You have built an operation tree.
            // Now you are goign to evaluate the expression and put the
            // result into *this
            op->DoOperation(*this);
        }
        MatrixOp* operator+(Matrix& rhs)  { return new MatrixOpPlus(*this,rhs);}
        MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
        // etc
};

Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.

回复收藏 0 原文

失与倦＂ 2024-10-25 06:08:00

您的 Vector 实现：

像这样实现 operator+()：

Vector  
Vector::operator+(const Vector &v)
{
    return Vector(x + v.x, y + v.y, z + v.z);
}

并在类定义中添加 inline 运算符（这可以避免返回地址和方法参数的堆栈推送和弹出）对于每个方法调用，如果编译器发现它有用）。

然后添加这个构造函数：

Vector::Vector(const double &x, const double &y, const double &z)
    : x(x), y(y), z(z)
{
}

它可以让您非常有效地构造一个新向量（就像您在我的 operator+() 建议中所做的那样）！

在使用您的向量的代码中：

您做了：

for(int i = 0; i < N; ++i)
{
    sum = sum + vec[i];
}

展开这种循环！在非常大循环中仅执行一个操作（因为它将被优化为使用SSE2/3扩展或类似的东西），效率非常低。你应该这样做：（

//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
    sum = sum + vec[i];
              + vec[i+1];
              + vec[i+2];
              + vec[i+3];
              + vec[i+4];
              + vec[i+5];
              + vec[i+6];
              + vec[i+7];
              + vec[i+8];
              + vec[i+9];
}

//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
    sum = sum + vec[i];
}

请注意，此代码未经测试，可能包含“逐一”错误左右......）

Your Vector implementation:

Implement the operator+() like this:

Vector  
Vector::operator+(const Vector &v)
{
    return Vector(x + v.x, y + v.y, z + v.z);
}

and add the inline operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).

Then add this constructor:

Vector::Vector(const double &x, const double &y, const double &z)
    : x(x), y(y), z(z)
{
}

which lets you construct a new vector very efficiently (like you would do in my operator+() suggestion)!

In the code using your Vector:

You did:

for(int i = 0; i < N; ++i)
{
    sum = sum + vec[i];
}

Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:

//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
    sum = sum + vec[i];
              + vec[i+1];
              + vec[i+2];
              + vec[i+3];
              + vec[i+4];
              + vec[i+5];
              + vec[i+6];
              + vec[i+7];
              + vec[i+8];
              + vec[i+9];
}

//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
    sum = sum + vec[i];
}

(Note that this code is untested and may contain a "off-by-one"-error or so...)

回复收藏 0 原文