使用 Eigen 的性能比使用我自己的类更差
几周前,我问了一个关于矩阵乘法性能的问题。
有人告诉我,为了提高程序的性能,我应该使用一些专门的矩阵类而不是我自己的类。
StackOverflow 用户推荐:
- uBLAS
- EIGEN
- BLAS
起初我想使用 uBLAS 但是阅读 文档 事实证明,这个库不支持矩阵-矩阵乘法。
毕竟我决定使用 EIGEN 库。因此,我将矩阵类交换为 Eigen::MatrixXd - 但事实证明,现在我的应用程序运行速度比以前还要慢。 使用 EIGEN 之前的时间为 68 秒,将我的矩阵类交换为 EIGEN 矩阵程序后运行了 87 秒。
程序中花费最多时间的部分如下所示
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
,正如
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
您所看到的,这些函数执行大量矩阵运算。这就是为什么我认为使用 Eigen 会加快我的应用程序速度。不幸的是(正如我上面提到的),该程序运行速度较慢。
有什么办法可以加速这些功能吗?
也许如果我使用 DirectX 矩阵运算我会获得更好的性能? (但是我有一台带集成显卡的笔记本电脑)。
A couple of weeks ago I asked a question about the performance of matrix multiplication.
I was told that in order to enhance the performance of my program I should use some specialised matrix classes rather than my own class.
StackOverflow users recommended:
- uBLAS
- EIGEN
- BLAS
At first I wanted to use uBLAS however reading documentation it turned out that this library doesn't support matrix-matrix multiplication.
After all I decided to use EIGEN library. So I exchanged my matrix class to Eigen::MatrixXd
- however it turned out that now my application works even slower than before.
Time before using EIGEN was 68 seconds and after exchanging my matrix class to EIGEN matrix program runs for 87 seconds.
Parts of program which take the most time looks like that
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
and
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
As You can see, these functions performs a lot of matrix operations. That is why I thought using Eigen would speed up my application. Unfortunately (as I mentioned above), the program works slower.
Is there any way to speed up these functions?
Maybe if I used DirectX matrix operations I would get better performance ?? (however I have a laptop with integrated graphic card).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
确保打开编译器优化(例如,gcc 上至少为 -O2)。 Eigen 是高度模板化的,如果不打开优化,性能不会很好。
Make sure to have compiler optimization switched on (e.g. at least -O2 on gcc). Eigen is heavily templated and will not perform very well if you don't turn on optimization.
如果您使用 Eigen 的
MatrixXd
类型,那么它们的大小是动态调整的。使用固定大小类型(例如Matrix4d
、Vector4d
),您应该会获得更好的结果。另外,请确保您进行的编译使代码可以矢量化;请参阅相关 Eigen 文档。
关于使用 Direct3D 扩展库内容(D3DXMATRIX 等)的想法:对于图形几何(4x4 变换等)来说它还可以(如果有点老式),但它肯定不是 GPU 加速的(我认为只是很好的旧 SSE)。另外,请注意,它只是浮点精度(您似乎设置为使用双精度)。就我个人而言,我更喜欢使用 Eigen,除非我实际上正在编写 Direct3D 应用程序。
If you're using Eigen's
MatrixXd
types, those are dynamically sized. You should get much better results from using the fixed size types e.gMatrix4d
,Vector4d
.Also, make sure you're compiling such that the code can get vectorized; see the relevant Eigen documentation.
Re your thought on using the Direct3D extensions library stuff (D3DXMATRIX etc): it's OK (if a bit old fashioned) for graphics geometry (4x4 transforms etc), but it's certainly not GPU accelerated (just good old SSE, I think). Also, note that it's floating point precision only (you seem to be set on using doubles). Personally I'd much prefer to use Eigen unless I was actually coding a Direct3D app.
您使用的是哪个版本的 Eigen?他们最近发布了 3.0.1,应该比 2.x 更快。另外,请确保稍微使用一下编译器选项。例如,确保 Visual Studio 中正在使用 SSE:
Which version of Eigen are you using? They recently released 3.0.1, which is supposed to be faster than 2.x. Also, make sure you play a bit with the compiler options. For example, make sure SSE is being used in Visual Studio:
您应该首先分析并优化算法,然后优化实现。特别是,发布的代码效率很低:
我不知道这个库,所以我什至不会尝试猜测您正在创建的不必要的临时对象的数量,但一个简单的重构:
可以为您节省大量的 < em>昂贵的乘法(同样,可能会立即丢弃新的临时矩阵。
You should profile and then optimize first the algorithm, then the implementation. In particular, the posted code is quite innefficient:
I don't know the library, so I won't even try to guess the number of unnecessary temporaries that you are creating, but a simple refactor:
Can save you a good amount of expensive multiplications (and again, probably new temporary matrices that get discarded right away.
有几点。
当该乘积每次迭代具有相同的值时,为什么要在循环内乘以旋转*缩放?这会浪费很多精力。
您正在使用动态大小的矩阵而不是固定大小的矩阵。其他人已经提到过这一点,你说你节省了 2 秒。
您将参数作为指向矩阵的指针向量传递。这会增加额外的指针间接寻址并破坏数据局部性的任何保证,从而导致缓存性能较差。
我希望这不是侮辱,但是你是在发布版还是调试版中编译? Eigen 在调试构建中非常慢,因为它使用了许多琐碎的模板化函数,这些函数在发布时进行了优化,但仍处于调试状态。
看看你的代码,我不愿意将性能问题归咎于 Eigen。然而,大多数线性代数库(包括 Eigen)并不是真正为您的大量微小矩阵的用例而设计的。一般来说,Eigen 对于 100x100 或更大的矩阵会得到更好的优化。使用自己的矩阵类或 DirectX 数学帮助器类可能会更好。 DirectX 数学类完全独立于您的显卡。
A couple of points.
Why are you multiplying rotation*scale inside of the loop when that product will have the same value each iteration? That is a lot of wasted effort.
You are using dynamically sized matrices rather than fixed sized matrices. Someone else mentioned this already, and you said you shaved off 2 sec.
You are passing arguments as a vector of pointers to matrices. This adds an extra pointer indirection and destroys any guarantee of data locality, which will give poor cache performance.
I hope this isn't insulting, but are you compiling in Release or Debug? Eigen is very slow in debug builds, because it uses lots of trivial templated functions that are optimized out of release but remain in debug.
Looking at your code, I am hesitant to blame Eigen for performance problems. However, most linear algebra libraries (including Eigen) are not really designed for your use case of lots of tiny matrices. In general, Eigen will be better optimized for 100x100 or larger matrices. You very well may be better off using your own matrix class or the DirectX math helper classes. The DirectX math classes are completely independent from your video card.
回顾您之前的帖子和其中的代码,我的建议是使用您的旧代码,但通过移动内容来提高其效率。我在上一个问题上发帖是为了将答案分开。
Looking back at your previous post and the code in there, my suggestion would be to use your old code, but improve its efficiency by moving things around. I'm posting on that previous question to keep the answers separate.