OpenCV CUDA 运行速度比 OpenCV CPU 慢
当我从 avi 文件中读取视频时,我一直在努力让 OpenCV CUDA 来提高侵蚀/扩张、帧差异等方面的性能。通常,我在 GPU (580gtx) 上获得的 FPS 是在 CPU (AMD 955BE) 上的一半。在您询问我是否正确测量 fps 之前,您可以用肉眼清楚地看到 GPU 上的延迟,尤其是在使用高腐蚀/膨胀级别时。
看来我没有并行读取帧?代码如下:
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <stdlib.h>
#include <stdio.h>
using namespace cv;
using namespace cv::gpu;
Mat cpuSrc;
GpuMat src, dst;
int element_shape = MORPH_RECT;
//the address of variable which receives trackbar position update
int max_iters = 10;
int open_close_pos = 0;
int erode_dilate_pos = 0;
// callback function for open/close trackbar
void OpenClose(int)
{
IplImage disp;
Mat temp;
int n = open_close_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::morphologyEx(src, dst, CV_MOP_OPEN, element);
else
cv::gpu::morphologyEx(src, dst, CV_MOP_CLOSE, element);
dst.download(temp);
disp = temp;
// cvShowImage("Open/Close",&disp);
}
// callback function for erode/dilate trackbar
void ErodeDilate(int)
{
IplImage disp;
Mat temp;
int n = erode_dilate_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::erode(src, dst, element);
else
cv::gpu::dilate(src, dst, element);
dst.download(temp);
disp = temp;
cvShowImage("Erode/Dilate",&disp);
}
int main( int argc, char** argv )
{
VideoCapture capture("TwoManLoiter.avi");
//create windows for output images
namedWindow("Open/Close",1);
namedWindow("Erode/Dilate",1);
open_close_pos = 3;
erode_dilate_pos = 0;
createTrackbar("iterations", "Open/Close",&open_close_pos,max_iters*2+1,NULL);
createTrackbar("iterations", "Erode/Dilate",&erode_dilate_pos,max_iters*2+1,NULL);
for(;;)
{
capture >> cpuSrc;
src.upload(cpuSrc);
GpuMat grey;
cv::gpu::cvtColor(src, grey, CV_BGR2GRAY);
src = grey;
int c;
ErodeDilate(erode_dilate_pos);
c = cvWaitKey(25);
if( (char)c == 27 )
break;
}
return 0;
}
CPU 实现是相同的,只是使用命名空间 cv::gpu 和 Mat(当然不是 GpuMat)。
谢谢
I've been struggling to get OpenCV CUDA to improve performance for things like erode/dilate, frame differencing etc when i read in a video from an avi file. typical i get half the FPS on the GPU (580gtx) than on the CPU (AMD 955BE). Before u ask if i'm measuring fps correctly, you can clearly see the lag on the GPU with the naked eye especially when using a high erode/dilate level.
It seems that i'm not reading in the frames in parallel?? Here is the code:
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <stdlib.h>
#include <stdio.h>
using namespace cv;
using namespace cv::gpu;
Mat cpuSrc;
GpuMat src, dst;
int element_shape = MORPH_RECT;
//the address of variable which receives trackbar position update
int max_iters = 10;
int open_close_pos = 0;
int erode_dilate_pos = 0;
// callback function for open/close trackbar
void OpenClose(int)
{
IplImage disp;
Mat temp;
int n = open_close_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::morphologyEx(src, dst, CV_MOP_OPEN, element);
else
cv::gpu::morphologyEx(src, dst, CV_MOP_CLOSE, element);
dst.download(temp);
disp = temp;
// cvShowImage("Open/Close",&disp);
}
// callback function for erode/dilate trackbar
void ErodeDilate(int)
{
IplImage disp;
Mat temp;
int n = erode_dilate_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::erode(src, dst, element);
else
cv::gpu::dilate(src, dst, element);
dst.download(temp);
disp = temp;
cvShowImage("Erode/Dilate",&disp);
}
int main( int argc, char** argv )
{
VideoCapture capture("TwoManLoiter.avi");
//create windows for output images
namedWindow("Open/Close",1);
namedWindow("Erode/Dilate",1);
open_close_pos = 3;
erode_dilate_pos = 0;
createTrackbar("iterations", "Open/Close",&open_close_pos,max_iters*2+1,NULL);
createTrackbar("iterations", "Erode/Dilate",&erode_dilate_pos,max_iters*2+1,NULL);
for(;;)
{
capture >> cpuSrc;
src.upload(cpuSrc);
GpuMat grey;
cv::gpu::cvtColor(src, grey, CV_BGR2GRAY);
src = grey;
int c;
ErodeDilate(erode_dilate_pos);
c = cvWaitKey(25);
if( (char)c == 27 )
break;
}
return 0;
}
The CPU implementation is the same minus using namespace cv::gpu and the Mat instead of GpuMat of course.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的猜测是,GPU 侵蚀/扩张带来的性能增益被每帧将图像传输至 GPU 或从 GPU 传输出的内存操作所影响。请记住,内存带宽是 GPGPU 算法的关键因素,尤其是 CPU 和 GPU 之间的带宽。
编辑:要优化它,您可以编写自己的图像显示例程(而不是 cvShowImage),该例程使用 OpenGL 并且仅将图像显示为 OpenGL 纹理。在这种情况下,您不需要将处理后的图像从 GPU 读回到 CPU,您可以直接使用 OpenGL 纹理/缓冲区作为 CUDA 图像/缓冲区,因此您甚至不需要在 GPU 内复制图像。但在这种情况下,您可能必须自己管理 CUDA 资源。通过这种方法,您还可以使用 PBO 将视频上传到纹理中,并从异步性中获益。
My guess would be, that the performance gain from the GPU erode/dilate is overweighted by the memory operations of transferring the image to and from the GPU every frame. Keep in mind that memory bandwidth is a crucial factor in GPGPU algorithms, and even more the bandwidth between CPU and GPU.
EDIT: To optimize it you might write your own image display routine (instead of cvShowImage) that uses OpenGL and just displays the image as an OpenGL texture. In this case you don't need to read the processed image from the GPU back to CPU and you can directly use an OpenGL texture/buffer as a CUDA image/buffer, so you don't even need to copy the image inside the GPU. But in this case you might have to manage CUDA resources yourself. With this method you might also use PBOs to upload the video into the texture and profit a bit from asynchronity.