OpenCV CUDA 运行速度比 OpenCV CPU 慢

发布于 2024-12-06 03:49:09 字数 2509 浏览 0 评论 0原文

当我从 avi 文件中读取视频时,我一直在努力让 OpenCV CUDA 来提高侵蚀/扩张、帧差异等方面的性能。通常,我在 GPU (580gtx) 上获得的 FPS 是在 CPU (AMD 955BE) 上的一半。在您询问我是否正确测量 fps 之前,您可以用肉眼清楚地看到 GPU 上的延迟,尤其是在使用高腐蚀/膨胀级别时。

看来我没有并行读取帧?代码如下:

#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <stdlib.h>
#include <stdio.h>

using namespace cv;
using namespace cv::gpu;

Mat cpuSrc;
GpuMat src, dst;

int element_shape = MORPH_RECT;

//the address of variable which receives trackbar position update
int max_iters = 10;
int open_close_pos = 0;
int erode_dilate_pos = 0;

// callback function for open/close trackbar
void OpenClose(int)
{
     IplImage disp;
     Mat temp;
    int n = open_close_pos - max_iters;
    int an = n > 0 ? n : -n;
    Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
    if( n < 0 )
        cv::gpu::morphologyEx(src, dst, CV_MOP_OPEN, element);
    else
        cv::gpu::morphologyEx(src, dst, CV_MOP_CLOSE, element);

    dst.download(temp);
    disp = temp;    
   // cvShowImage("Open/Close",&disp);
}

// callback function for erode/dilate trackbar
void ErodeDilate(int)
{
     IplImage disp;
     Mat temp;
    int n = erode_dilate_pos - max_iters;
    int an = n > 0 ? n : -n;
    Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
    if( n < 0 )
        cv::gpu::erode(src, dst, element);
    else
        cv::gpu::dilate(src, dst, element);
    dst.download(temp);
    disp = temp;    
    cvShowImage("Erode/Dilate",&disp);
}


int main( int argc, char** argv )
{

    VideoCapture capture("TwoManLoiter.avi");

    //create windows for output images
    namedWindow("Open/Close",1);
    namedWindow("Erode/Dilate",1);

    open_close_pos = 3;
    erode_dilate_pos = 0;
    createTrackbar("iterations", "Open/Close",&open_close_pos,max_iters*2+1,NULL);
    createTrackbar("iterations", "Erode/Dilate",&erode_dilate_pos,max_iters*2+1,NULL);

    for(;;)
    {

         capture >> cpuSrc;
         src.upload(cpuSrc);
         GpuMat grey;
         cv::gpu::cvtColor(src, grey, CV_BGR2GRAY); 
         src = grey;

        int c;

        ErodeDilate(erode_dilate_pos);
        c = cvWaitKey(25);

        if( (char)c == 27 )
            break;

    }

    return 0;
}

CPU 实现是相同的,只是使用命名空间 cv::gpu 和 Mat(当然不是 GpuMat)。

谢谢

I've been struggling to get OpenCV CUDA to improve performance for things like erode/dilate, frame differencing etc when i read in a video from an avi file. typical i get half the FPS on the GPU (580gtx) than on the CPU (AMD 955BE). Before u ask if i'm measuring fps correctly, you can clearly see the lag on the GPU with the naked eye especially when using a high erode/dilate level.

It seems that i'm not reading in the frames in parallel?? Here is the code:

#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <stdlib.h>
#include <stdio.h>

using namespace cv;
using namespace cv::gpu;

Mat cpuSrc;
GpuMat src, dst;

int element_shape = MORPH_RECT;

//the address of variable which receives trackbar position update
int max_iters = 10;
int open_close_pos = 0;
int erode_dilate_pos = 0;

// callback function for open/close trackbar
void OpenClose(int)
{
     IplImage disp;
     Mat temp;
    int n = open_close_pos - max_iters;
    int an = n > 0 ? n : -n;
    Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
    if( n < 0 )
        cv::gpu::morphologyEx(src, dst, CV_MOP_OPEN, element);
    else
        cv::gpu::morphologyEx(src, dst, CV_MOP_CLOSE, element);

    dst.download(temp);
    disp = temp;    
   // cvShowImage("Open/Close",&disp);
}

// callback function for erode/dilate trackbar
void ErodeDilate(int)
{
     IplImage disp;
     Mat temp;
    int n = erode_dilate_pos - max_iters;
    int an = n > 0 ? n : -n;
    Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
    if( n < 0 )
        cv::gpu::erode(src, dst, element);
    else
        cv::gpu::dilate(src, dst, element);
    dst.download(temp);
    disp = temp;    
    cvShowImage("Erode/Dilate",&disp);
}


int main( int argc, char** argv )
{

    VideoCapture capture("TwoManLoiter.avi");

    //create windows for output images
    namedWindow("Open/Close",1);
    namedWindow("Erode/Dilate",1);

    open_close_pos = 3;
    erode_dilate_pos = 0;
    createTrackbar("iterations", "Open/Close",&open_close_pos,max_iters*2+1,NULL);
    createTrackbar("iterations", "Erode/Dilate",&erode_dilate_pos,max_iters*2+1,NULL);

    for(;;)
    {

         capture >> cpuSrc;
         src.upload(cpuSrc);
         GpuMat grey;
         cv::gpu::cvtColor(src, grey, CV_BGR2GRAY); 
         src = grey;

        int c;

        ErodeDilate(erode_dilate_pos);
        c = cvWaitKey(25);

        if( (char)c == 27 )
            break;

    }

    return 0;
}

The CPU implementation is the same minus using namespace cv::gpu and the Mat instead of GpuMat of course.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜无邪 2024-12-13 03:49:09

我的猜测是,GPU 侵蚀/扩张带来的性能增益被每帧将图像传输至 GPU 或从 GPU 传输出的内存操作所影响。请记住,内存带宽是 GPGPU 算法的关键因素,尤其是 CPU 和 GPU 之间的带宽。

编辑:要优化它,您可以编写自己的图像显示例程(而不是 cvShowImage),该例程使用 OpenGL 并且仅将图像显示为 OpenGL 纹理。在这种情况下,您不需要将处理后的图像从 GPU 读回到 CPU,您可以直接使用 OpenGL 纹理/缓冲区作为 CUDA 图像/缓冲区,因此您甚至不需要在 GPU 内复制图像。但在这种情况下,您可能必须自己管理 CUDA 资源。通过这种方法,您还可以使用 PBO 将视频上传到纹理中,并从异步性中获益。

My guess would be, that the performance gain from the GPU erode/dilate is overweighted by the memory operations of transferring the image to and from the GPU every frame. Keep in mind that memory bandwidth is a crucial factor in GPGPU algorithms, and even more the bandwidth between CPU and GPU.

EDIT: To optimize it you might write your own image display routine (instead of cvShowImage) that uses OpenGL and just displays the image as an OpenGL texture. In this case you don't need to read the processed image from the GPU back to CPU and you can directly use an OpenGL texture/buffer as a CUDA image/buffer, so you don't even need to copy the image inside the GPU. But in this case you might have to manage CUDA resources yourself. With this method you might also use PBOs to upload the video into the texture and profit a bit from asynchronity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文