Qt 和 CUDA VIsual Profiler 内存传输大小错误

发布于 2024-11-03 14:11:11 字数 1618 浏览 0 评论 0原文

我准备了一个 .pro 文件,以便在 Linux 机器(64 位)中使用 Qt 和 CUDA。当我将应用程序运行到 CUDA 分析器中时,该应用程序执行了 12 次,但在显示结果之前我收到了下一个错误:

探查器数据文件“/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv”中第 6 行“内存传输大小”列的错误。

main.cpp 文件很简单,

#include <QtCore/QCoreApplication> 
extern "C"
void runCudaPart();

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);
    runCudaPart();
    return 0;
}

事实是,如果我删除“QCoreApplication a(argc, argv);”行 CUDA Visual Profiler 正常工作并显示所有结果。

我已经检查过,如果导出 CUDA_PROFILE=1 环境变量,cuda_profile.log 是从命令行生成的。如果我导出 COMPUTE_PROFILE_CSV=1 变量,也会生成逗号分隔的文件但是当我尝试导入该文件时 CUDA Visual Profiler 崩溃。

关于这个问题有什么提示吗?这似乎与 CUDA 视觉分析器应用程序相关,而不是与代码相关。

如果你想知道为什么我用 Qt 做了一个如此简单的 main.cpp 但没有使用 Qt :P 是因为我想在将来改进框架以添加 GUI。

// CUDA、GPU、操作系统、QT 和编译器版本的详细信息

  Device"GeForce GTX 480"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  OS: ubuntu 10.04 LTS
  QT_VERSION: 263682
  QT_VERSION_STR: 4.6.2
  gcc version 4.4.3
  nvcc compilation tool, release 3.2, V0.2.122

我注意到问题出在 QCoreApplication 构造 上。它对参数做了一些事情。如果我将该行修改为:

QCoreApplication a();

Visual Profiler 会正常工作。很难知道发生了什么以及这种变化将来是否会成为问题。有什么提示吗?

关于 QCoreApplication 构造,如果我在 QCoreApplication 之前调用 cuda 部分,则该示例也可以工作。

// this way the example works.
runCudaPart();
QCoreApplication a(argc, argv);

提前致谢。

I've prepared a .pro file for use Qt and CUDA in a linux machine (64bits). When I run the application into the CUDA profiler, the app executes 12 times but before present the results i get the next error:

Error in profiler data file '/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv' at line number 6 for column 'memory transfer size.

The main.cpp file is as simple as

#include <QtCore/QCoreApplication> 
extern "C"
void runCudaPart();

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);
    runCudaPart();
    return 0;
}

The fact is that if i remove the "QCoreApplication a(argc, argv);" line the CUDA Visual Profiler works as excepted and show all the results.

I've checked that the cuda_profile.log is generated from the command line if i export the CUDA_PROFILE=1 environment variable. The comma-separated file is also generated if i export the COMPUTE_PROFILE_CSV=1 variale but the CUDA Visual Profiler crashes when i try to import that file.

Any hints about this issue? It seems something related to the CUDA visual Profiler application not with the code.

If you are wondering why i did a so simple main.cpp with Qt but without using Qt :P is that i would like improve the framework in the future to add a GUI.

// details of CUDA, GPU, OS, QT, and compiler versions

  Device"GeForce GTX 480"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  OS: ubuntu 10.04 LTS
  QT_VERSION: 263682
  QT_VERSION_STR: 4.6.2
  gcc version 4.4.3
  nvcc compilation tool, release 3.2, V0.2.122

I've noticed that the problem is with the QCoreApplication construct. It does something with the arguments. If I modify the line as:

QCoreApplication a();

the Visual Profiler works as excepted. Hard to know what is happening and if this change will be a problem in the future. Any hints?

Regarding to the QCoreApplication construct the example also work if I call the cuda part before the QCoreApplication.

// this way the example works.
runCudaPart();
QCoreApplication a(argc, argv);

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

七度光 2024-11-10 14:11:11

我无法在 64 位 Ubuntu 10.04LTS 系统上使用 CUDA 3.2 和 QT4 重现此问题。我采用了这个 main:

#include <QtCore/QCoreApplication>

extern float cudamain();

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    float gflops = cudamain();

    return 0;
}

和一个包含这个的 cudamain() :(

#include <assert.h>

#define blocksize 16
#define HM (4096) 
#define WM (4096) 
#define WN (4096)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
{
    __shared__ float MS[blocksize][blocksize];
    __shared__ float NS[blocksize][blocksize];

    int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
    int rowM=ty+by*blocksize;
    int colN=tx+bx*blocksize;
    float Pvalue=0;

    for(int m=0; m<uWM; m+=blocksize){
        MS[ty][tx]=M[rowM*uWM+(m+tx)] ;
        NS[ty][tx]=M[colN + uWN*(m+ty)];
        __syncthreads();
        for(int k=0;k<blocksize;k++)
            Pvalue+=MS[ty][k]*NS[k][tx];
        __syncthreads();
    }
    P[rowM*WP+colN]=Pvalue;
}

inline void gpuerrorchk(cudaError_t state)
{
    assert(state == cudaSuccess);
}

float cudamain(){

    cudaEvent_t evstart, evstop;
    cudaEventCreate(&evstart);
    cudaEventCreate(&evstop);

    float*M=(float*)malloc(sizeof(float)*HM*WM);
    float*N=(float*)malloc(sizeof(float)*HN*WN);

    for(int i=0;i<WM*HM;i++)
        M[i]=(float)i;
    for(int i=0;i<WN*HN;i++)
        N[i]=(float)i;

    float*P=(float*)malloc(sizeof(float)*HP*WP);

    float *Md,*Nd,*Pd;
    gpuerrorchk( cudaMalloc((void**)&Md,HM*WM*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Nd,HN*WN*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Pd,HP*WP*sizeof(float)) );

    gpuerrorchk( cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice) );
    gpuerrorchk( cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice) );

    dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
    dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

    gpuerrorchk( cudaEventRecord(evstart,0) );

    nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM, WN);
    gpuerrorchk( cudaPeekAtLastError() );

    gpuerrorchk( cudaEventRecord(evstop,0) );
    gpuerrorchk( cudaEventSynchronize(evstop) );
    float time;
    cudaEventElapsedTime(&time,evstart,evstop);

    gpuerrorchk( cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost) );

    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);

    float gflops=(2.e-6*WM*WM*WM)/(time);

    cudaThreadExit();

    return gflops;

}

除了执行内存事务和运行内核之外,不要注意实际的代码,否则这是无意义的)。

像这样编译代码:

cuda:~$ nvcc -arch=sm_20 -c -o cudamain.o cudamain.cu 
cuda:~$ g++ -o qtprob -I/usr/include/qt4 qtprob.cc cudamain.o -L $CUDA_INSTALL_PATH/lib64 -lQtCore -lcuda -lcudart
cuda:~$ ldd qtprob
        linux-vdso.so.1 =>  (0x00007fff242c8000)
        libQtCore.so.4 => /opt/cuda-3.2/computeprof/bin/libQtCore.so.4 (0x00007fbe62344000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fbe61a3d000)
        libcudart.so.3 => /opt/cuda-3.2/lib64/libcudart.so.3 (0x00007fbe617ef000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbe614db000)
        libm.so.6 => /lib/libm.so.6 (0x00007fbe61258000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fbe61040000)
        libc.so.6 => /lib/libc.so.6 (0x00007fbe60cbd000)
        libz.so.1 => /lib/libz.so.1 (0x00007fbe60aa6000)
        libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007fbe608a0000)
        libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x00007fbe605c2000)
        librt.so.1 => /lib/librt.so.1 (0x00007fbe603ba000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fbe6019c000)
        libdl.so.2 => /lib/libdl.so.2 (0x00007fbe5ff98000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbe626c0000)
        libpcre.so.3 => /lib/libpcre.so.3 (0x00007fbe5fd69000)

生成一个可执行文件,只要我愿意使用 CUDA 3.2 版本分析器运行它,它就可以无错误地进行分析。

我所能建议的就是尝试我的重现案例,看看它是否有效。如果失败,则可能是您的 CUDA 或 QT 安装损坏。如果它没有失败(我怀疑它不会),那么您构建 QT 项目的方式或您正在运行的实际 CUDA 代码本身有问题。

I can't reproduce this with CUDA 3.2 and QT4 on a 64 bit Ubuntu 10.04LTS system. I took this main:

#include <QtCore/QCoreApplication>

extern float cudamain();

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    float gflops = cudamain();

    return 0;
}

and a cudamain() containing this:

#include <assert.h>

#define blocksize 16
#define HM (4096) 
#define WM (4096) 
#define WN (4096)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
{
    __shared__ float MS[blocksize][blocksize];
    __shared__ float NS[blocksize][blocksize];

    int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
    int rowM=ty+by*blocksize;
    int colN=tx+bx*blocksize;
    float Pvalue=0;

    for(int m=0; m<uWM; m+=blocksize){
        MS[ty][tx]=M[rowM*uWM+(m+tx)] ;
        NS[ty][tx]=M[colN + uWN*(m+ty)];
        __syncthreads();
        for(int k=0;k<blocksize;k++)
            Pvalue+=MS[ty][k]*NS[k][tx];
        __syncthreads();
    }
    P[rowM*WP+colN]=Pvalue;
}

inline void gpuerrorchk(cudaError_t state)
{
    assert(state == cudaSuccess);
}

float cudamain(){

    cudaEvent_t evstart, evstop;
    cudaEventCreate(&evstart);
    cudaEventCreate(&evstop);

    float*M=(float*)malloc(sizeof(float)*HM*WM);
    float*N=(float*)malloc(sizeof(float)*HN*WN);

    for(int i=0;i<WM*HM;i++)
        M[i]=(float)i;
    for(int i=0;i<WN*HN;i++)
        N[i]=(float)i;

    float*P=(float*)malloc(sizeof(float)*HP*WP);

    float *Md,*Nd,*Pd;
    gpuerrorchk( cudaMalloc((void**)&Md,HM*WM*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Nd,HN*WN*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Pd,HP*WP*sizeof(float)) );

    gpuerrorchk( cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice) );
    gpuerrorchk( cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice) );

    dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
    dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

    gpuerrorchk( cudaEventRecord(evstart,0) );

    nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM, WN);
    gpuerrorchk( cudaPeekAtLastError() );

    gpuerrorchk( cudaEventRecord(evstop,0) );
    gpuerrorchk( cudaEventSynchronize(evstop) );
    float time;
    cudaEventElapsedTime(&time,evstart,evstop);

    gpuerrorchk( cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost) );

    cudaFree(Md);
    cudaFree(Nd);
    cudaFree(Pd);

    float gflops=(2.e-6*WM*WM*WM)/(time);

    cudaThreadExit();

    return gflops;

}

(pay no attention to the actual code other than it doing memory transactions and running a kernel, it is nonsense otherwise).

Compiling the code like this:

cuda:~$ nvcc -arch=sm_20 -c -o cudamain.o cudamain.cu 
cuda:~$ g++ -o qtprob -I/usr/include/qt4 qtprob.cc cudamain.o -L $CUDA_INSTALL_PATH/lib64 -lQtCore -lcuda -lcudart
cuda:~$ ldd qtprob
        linux-vdso.so.1 =>  (0x00007fff242c8000)
        libQtCore.so.4 => /opt/cuda-3.2/computeprof/bin/libQtCore.so.4 (0x00007fbe62344000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fbe61a3d000)
        libcudart.so.3 => /opt/cuda-3.2/lib64/libcudart.so.3 (0x00007fbe617ef000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbe614db000)
        libm.so.6 => /lib/libm.so.6 (0x00007fbe61258000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fbe61040000)
        libc.so.6 => /lib/libc.so.6 (0x00007fbe60cbd000)
        libz.so.1 => /lib/libz.so.1 (0x00007fbe60aa6000)
        libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007fbe608a0000)
        libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x00007fbe605c2000)
        librt.so.1 => /lib/librt.so.1 (0x00007fbe603ba000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fbe6019c000)
        libdl.so.2 => /lib/libdl.so.2 (0x00007fbe5ff98000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbe626c0000)
        libpcre.so.3 => /lib/libpcre.so.3 (0x00007fbe5fd69000)

produces an executable which profiles without error as many times as I care to run it with the CUDA 3.2 release profiler.

All I can suggest is try my repro case and see whether it works or not. If it fails, then perhaps you have either a broken CUDA or QT installation. If it doesn't fail (and I suspect it won't), then you either have a problem with the way you are building the QT project or the actual CUDA code you are running itself.

夜未央樱花落 2024-11-10 14:11:11

@pQB
大家好,我是来自 NVIDIA 的 Ramesh。我们无法在此处本地重现此问题。当该列的值为空或无效时,就会出现这种错误。在您的情况下(探查器数据文件“/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv”中“内存传输大小”列第 6 行的错误)“内存传输大小”列的值为空或无效对于行号。 6 在 csv 文件中。

如果“temp_compute_profiler_0_0.csv”存在于您的工作目录中以及命令行分析器生成的 csv 中,您可以发送它吗?如果不可能,请检查第 1 行中该列(内存传输大小)获得的值。 6.

您是否使用 Visual Profiler 中的默认设置运行应用程序?您可以尝试运行您的应用程序禁用“内存传输大小”选项吗?要禁用此选项,请单击菜单“会话->会话设置...”,在会话设置对话框中单击“其他选项”选项卡,取消选中“内存传输大小”

@pQB
Hello, I am Ramesh from NVIDIA. We could not reproduce this issue locally here. This kind of error comes when the value for that column is either empty or invalid. In your case (Error in profiler data file '/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv' at line number 6 for column 'memory transfer size) the value for column ‘memory transfer size’ is either empty or invalid for line no. 6 in the csv file.

Can you send ‘temp_compute_profiler_0_0.csv’ if it is present in you working directory and the csv generated by command line profiler. If it is not possible check what value you getting for that column (memory transfer size) in line no. 6.

Are you running your app with default settings in Visual Profiler? Can you try running your app disabling 'memory transfer size' option? To disable this option click menu “Session->Session Settings…”, on session settings dialog click “Other Options” tab, uncheck “memory transfer size”

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文