为什么我的SSE不比C/C快++代码?
我刚刚开始使用 SSE 来优化计算机视觉项目的代码,旨在检测图像中的肤色。下面是我的功能。该函数获取彩色图像并查看每个像素并返回概率图。注释掉的代码是我原来的C++实现,其余的是SSE版本。我对它们都进行了计时,奇怪的是发现 SSE 并不比我原来的 C++ 代码快。关于正在发生的事情或如何进一步优化功能有什么建议吗?
void EvalSkinProb(const Mat& cvmColorImg, Mat& cvmProb)
{
std::clock_t ts = std::clock();
Mat cvmHSV = Mat::zeros(cvmColorImg.rows, cvmColorImg.cols, CV_8UC3);
cvtColor(cvmColorImg, cvmHSV, CV_BGR2HSV);
std::clock_t te1 = std::clock();
float fFG, fBG;
double dp;
__declspec(align(16)) int frgb[4] = {0};
__declspec(align(16)) int fBase[4] = {g_iLowHue, g_iLowSat, g_iLowVal, 0};
__declspec(align(16)) int fIndx[4] = {0};
__m128i* pSrc1 = (__m128i*) frgb;
__m128i* pSrc2 = (__m128i*) fBase;
__m128i* pDest = (__m128i*) fIndx;
__m128i m1;
for (int y = 0; y < cvmColorImg.rows; y++)
{
for (int x = 0; x < cvmColorImg.cols; x++)
{
cv::Vec3b hsv = cvmHSV.at<cv::Vec3b>(y, x);
frgb[0] = hsv[0];hsv[1] = hsv[1];hsv[2] =hsv[2];
m1 = _mm_sub_epi32(*pSrc1, *pSrc2);
*pDest = _mm_srli_epi32(m1, g_iSValPerbinBit);
// c++ code
//fIndx[0] = ((hsv[0]-g_iLowHue)>>g_iSValPerbinBit);
//fIndx[1] = ((hsv[1]-g_iLowSat)>>g_iSValPerbinBit);
//fIndx[2] = ((hsv[2]-g_iLowVal)>>g_iSValPerbinBit);
fFG = m_cvmSkinHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
fBG = m_cvmBGHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
dp = (double)fFG/(fBG+fFG);
cvmProb.at<double>(y, x) = dp;
}
}
std::clock_t te2 = std::clock();
double dSecs1 = (double)(te1-ts)/(CLOCKS_PER_SEC);
double dSecs2 = (double)(te2-te1)/(CLOCKS_PER_SEC);
}
I just started to use SSE to optimize my code for a computer vision project, aiming at detecting skin color in an image. Below is my function. The function takes a color image and looks at each pixel and returns a probability map. The commented out code was my original C++ implementation and the rest is the SSE version. I timed both of them and it is wierd to find out SSE isn't any faster than my original C++ code. Any suggestions about what's going on or how to optimize the function further?
void EvalSkinProb(const Mat& cvmColorImg, Mat& cvmProb)
{
std::clock_t ts = std::clock();
Mat cvmHSV = Mat::zeros(cvmColorImg.rows, cvmColorImg.cols, CV_8UC3);
cvtColor(cvmColorImg, cvmHSV, CV_BGR2HSV);
std::clock_t te1 = std::clock();
float fFG, fBG;
double dp;
__declspec(align(16)) int frgb[4] = {0};
__declspec(align(16)) int fBase[4] = {g_iLowHue, g_iLowSat, g_iLowVal, 0};
__declspec(align(16)) int fIndx[4] = {0};
__m128i* pSrc1 = (__m128i*) frgb;
__m128i* pSrc2 = (__m128i*) fBase;
__m128i* pDest = (__m128i*) fIndx;
__m128i m1;
for (int y = 0; y < cvmColorImg.rows; y++)
{
for (int x = 0; x < cvmColorImg.cols; x++)
{
cv::Vec3b hsv = cvmHSV.at<cv::Vec3b>(y, x);
frgb[0] = hsv[0];hsv[1] = hsv[1];hsv[2] =hsv[2];
m1 = _mm_sub_epi32(*pSrc1, *pSrc2);
*pDest = _mm_srli_epi32(m1, g_iSValPerbinBit);
// c++ code
//fIndx[0] = ((hsv[0]-g_iLowHue)>>g_iSValPerbinBit);
//fIndx[1] = ((hsv[1]-g_iLowSat)>>g_iSValPerbinBit);
//fIndx[2] = ((hsv[2]-g_iLowVal)>>g_iSValPerbinBit);
fFG = m_cvmSkinHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
fBG = m_cvmBGHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
dp = (double)fFG/(fBG+fFG);
cvmProb.at<double>(y, x) = dp;
}
}
std::clock_t te2 = std::clock();
double dSecs1 = (double)(te1-ts)/(CLOCKS_PER_SEC);
double dSecs2 = (double)(te2-te1)/(CLOCKS_PER_SEC);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里的第一个问题是,对于大量的数据移动,您只做了很少的 SSE 工作。您将花费大部分时间仅在 SSE 寄存器中打包/解包 2 条指令的数据...
其次,此代码中会出现非常微妙的性能损失。
您正在使用缓冲区在变量和 SSE 寄存器之间传输数据。这是一个大忌。
其原因在于 CPU 加载/存储单元。当您将数据写入内存位置,然后立即尝试以不同的字大小将其读回时,通常会强制将数据一直刷新到缓存并重新读取。这可能会导致 20 多个周期的惩罚。
这是因为 CPU 加载/存储单元没有针对这种异常访问进行优化。
The first problem here is that you're doing very little SSE work for a tremendous amount of data movement. You'll spend most of the time just packing/unpacking data in the SSE registers for 2 instructions...
Secondly, there is a very subtle performance penalty that will occur in this code.
You are using a buffer to transfer data between variables and SSE registers. This is a BIG NO-NO.
The reason for this is in the CPU load/store unit. When you write data to a memory location, and then immediately attempt to read it back in a different word size, it usually forces the data to be flushed all the way to cache and re-read. This can incur 20+ cycles of penalty.
This is because CPU load/store units are not optimized for this kind of unusual access.
我对 OpenCV 不太熟悉,但我怀疑只有确保您正在访问的数据已经在循环外部对齐,而不是在循环内部未对齐地加载数据,您才能获得不错的吞吐量。环形。
I'm not too familiar with OpenCV, but I suspect you are only going to get decent throughput if you ensure the data you're accessing is already aligned outside the loop, rather than loading it unaligned inside the loop.
未经测试,但它应该给你一些想法:
Not tested, but it should give you some ideas: