当前位置：文江博客话题详情

使用 CUDA 显示 GPU 优于 CPU 的最简单示例

发布于 2024-12-08 08:20:38 字数 201 浏览 2 评论 0原文

我正在寻找最简洁的代码量，可以为 CPU（使用 g++）和 GPU（使用 nvcc）编写代码，并且 GPU 的性能始终优于 CPU。任何类型的算法都是可以接受的。

澄清一下：我实际上是在寻找两个短代码块，一个用于 CPU（在 g++ 中使用 C++），另一个用于 GPU（在 nvcc 中使用 C++），GPU 的性能优于 GPU。最好以秒或毫秒为单位。尽可能短的代码对。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁梦 2024-12-15 08:20:38

首先，我将重申我的评论：GPU 具有高带宽、高延迟的特点。试图让 GPU 在纳秒作业（甚至毫秒或秒作业）上击败 CPU 完全没有实现 GPU 的目的。下面是一些简单的代码，但要真正体会 GPU 的性能优势，您需要一个大的问题规模来分摊启动成本……否则，它毫无意义。我可以在两英尺比赛中击败法拉利，仅仅是因为转动钥匙、启动发动机和踩踏板需要一些时间。这并不意味着我在任何意义上都比法拉利更快。

在 C++ 中使用类似的内容：

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

在 CUDA/C 中使用类似的内容：

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

如果这不起作用，请尝试使 N 和 M 更大，或将 256 更改为 128 或 512。

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

Use something like this in CUDA/C:

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

回复收藏 0 原文

深海夜未眠 2024-12-15 08:20:38

作为参考，我做了一个类似的时间测量示例。使用 GTX 660，GPU 加速为 24 倍，其操作除了实际计算之外还包括数据传输。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024

void serial_add(double *a, double *b, double *c, int n, int m)
{
    for(int index=0;index<n;index++)
    {
        for(int j=0;j<m;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
    }
}

__global__ void vector_add(double *a, double *b, double *c)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
        for(int j=0;j<M;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
}

int main()
{
    clock_t start,end;

    double *a, *b, *c;
    int size = N * sizeof( double );

    a = (double *)malloc( size );
    b = (double *)malloc( size );
    c = (double *)malloc( size );

    for( int i = 0; i < N; i++ )
    {
        a[i] = b[i] = i;
        c[i] = 0;
    }

    start = clock();
    serial_add(a, b, c, N, M);

    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );

    end = clock();

    float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("Serial: %f seconds\n",time1);

    start = clock();
    double *d_a, *d_b, *d_c;


    cudaMalloc( (void **) &d_a, size );
    cudaMalloc( (void **) &d_b, size );
    cudaMalloc( (void **) &d_c, size );


    cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );

    vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );

    cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );


    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );


    free(a);
    free(b);
    free(c);
    cudaFree( d_a );
    cudaFree( d_b );
    cudaFree( d_c );

    end = clock();
    float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);

    return 0;
}

For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024

void serial_add(double *a, double *b, double *c, int n, int m)
{
    for(int index=0;index<n;index++)
    {
        for(int j=0;j<m;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
    }
}

__global__ void vector_add(double *a, double *b, double *c)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
        for(int j=0;j<M;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
}

int main()
{
    clock_t start,end;

    double *a, *b, *c;
    int size = N * sizeof( double );

    a = (double *)malloc( size );
    b = (double *)malloc( size );
    c = (double *)malloc( size );

    for( int i = 0; i < N; i++ )
    {
        a[i] = b[i] = i;
        c[i] = 0;
    }

    start = clock();
    serial_add(a, b, c, N, M);

    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );

    end = clock();

    float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("Serial: %f seconds\n",time1);

    start = clock();
    double *d_a, *d_b, *d_c;


    cudaMalloc( (void **) &d_a, size );
    cudaMalloc( (void **) &d_b, size );
    cudaMalloc( (void **) &d_c, size );


    cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );

    vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );

    cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );


    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );


    free(a);
    free(b);
    free(c);
    cudaFree( d_a );
    cudaFree( d_b );
    cudaFree( d_c );

    end = clock();
    float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);

    return 0;
}

回复收藏 0 原文

不语却知心 2024-12-15 08:20:38

一种非常非常简单的方法是计算前 100,000 个整数的平方，或者计算大型矩阵运算。它很容易实现，并且通过避免分支、不需要堆栈等来发挥 GPU 的优势。我不久前使用 OpenCL 与 C++ 进行了此操作，并得到了一些非常惊人的结果。（2GB GTX460 的性能大约是双核 CPU 的 40 倍。）

您是在寻找示例代码，还是只是想法？

编辑

40x 是与双核 CPU 相比，而不是四核。

一些提示：

确保在运行基准测试时没有运行 Crysis 等软件。
关闭所有可能占用 CPU 时间的不必要的应用程序和服务。
确保您的孩子在基准测试运行时不会开始在您的电脑上观看电影。硬件 MPEG 解码往往会影响结果。（自动播放让我两岁的孩子通过插入磁盘来启动卑鄙的我。耶。）

正如我在对 @Paul R 的评论回复中所说，考虑使用 OpenCL，因为它可以轻松地让您在 GPU 和 CPU 上运行相同的代码，而无需必须重新实现它。

（回想起来，这些可能是非常明显的。）

回复收藏 0 原文

毁梦 2024-12-15 08:20:38

我同意 David 的评论，认为 OpenCL 是测试这一点的好方法，因为在 CPU 和 GPU 上运行代码之间切换是多么容易。如果您能够在 Mac 上工作，Apple 有一些很好的示例代码，可以执行使用 OpenCL 进行 N 体模拟，内核在 CPU、GPU 或两者上运行。您可以在它们之间实时切换，并且 FPS 计数显示在屏幕上。

对于更简单的情况，他们有一个 "hello world" OpenCL命令行应用程序，以类似于 David 描述的方式计算平方。这或许可以毫不费力地移植到非 Mac 平台。要在 GPU 和 CPU 使用率之间切换，我相信您只需将

int gpu = 1;

hello.c 源文件中的行更改为 0 表示 CPU，1 表示 GPU。

Apple 在其中提供了更多 OpenCL 示例代码主要 Mac 源代码列表。

David Gohara 博士在这个介绍性视频会议的最后提供了一个 OpenCL GPU 在执行分子动力学计算时加速的示例主题（大约第 34 分钟左右）。在他的计算中，他发现从在 8 个 CPU 内核上运行的并行实现改为在单个 GPU 上运行，速度大约提高了 27 倍。同样，这不是最简单的示例，但它展示了真实世界的应用程序以及在 GPU 上运行某些计算的优势。

我还做了一些修补移动空间使用 OpenGL ES 着色器执行基本计算。我发现，在 GPU 上作为着色器运行时，在图像上运行的简单颜色阈值着色器比在该特定设备的 CPU 上执行的相同计算快大约 14-28 倍。

I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.

For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the

int gpu = 1;

line in the hello.c source file to 0 for CPU, 1 for GPU.

Apple has some more OpenCL example code in their main Mac source code listing.

Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.

I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.

回复收藏 0 原文

~没有更多了~