GT540M 上的低性能 CUDA 代码

发布于 2025-01-05 08:07:44 字数 3448 浏览 0 评论 0原文

在 GeForce GT540M 上执行以下代码示例大约需要 750 毫秒,而相同的代码在 GT330M 上执行大约需要 250 毫秒。

将 dev_a 和 dev_b 复制到 CUDA 设备内存大约需要 350 毫秒,在 GT540M 上大约需要 250 毫秒。在 GT540M 上执行“addCuda”并将其复制回主机还需要大约 400 毫秒,对于 GT330M 则需要大约 0 毫秒。

这不是我所期望的,所以我检查了设备的属性,发现GT540M设备在所有方面都超过或等于GT330M,除了多处理器的数量——GT540M有2个,GT330M有6个。这真的是真的吗?如果是的话,它真的会对执行时间产生如此大的影响吗?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define T 512
#define N 60000*T

__global__ void addCuda(double *a, double *b, double *c) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if(tid < N) {
        c[tid] = sqrt(fabs(a[tid] * b[tid] / 12.34567)) * cos(a[tid]);
    }
}

int main() {
    double *dev_a, *dev_b, *dev_c;

    double* a = (double*)malloc(N*sizeof(double));
    double* b = (double*)malloc(N*sizeof(double));
    double* c = (double*)malloc(N*sizeof(double));

    printf("Filling arrays (CPU)...\n\n");
    int i;
    for(i = 0; i < N; i++) {
        a[i] = (double)-i;
        b[i] = (double)i;
    }

    int timer = clock();
    cudaMalloc((void**) &dev_a, N*sizeof(double));
    cudaMalloc((void**) &dev_b, N*sizeof(double));
    cudaMalloc((void**) &dev_c, N*sizeof(double));
    cudaMemcpy(dev_a, a, N*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*sizeof(double), cudaMemcpyHostToDevice);

    printf("Memcpy time: %d\n", clock() - timer);
    addCuda<<<(N+T-1)/T,T>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, N*sizeof(double), cudaMemcpyDeviceToHost);

    printf("Time elapsed: %d\n", clock() - timer);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);

return 0;
}

设备的设备属性:

GT540M:

Major revision number:         2
Minor revision number:         1
Name:                          GeForce GT 540M
Total global memory:           1073741824
Total shared memory per block: 49152
Total registers per block:     32768
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1344000
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     2
Kernel execution timeout:      Yes

GT330M

Major revision number:         1
Minor revision number:         2
Name:                          GeForce GT 330M
Total global memory:           268435456
Total shared memory per block: 16384
Total registers per block:     16384
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     512
Maximum dimension 0 of block:  512
Maximum dimension 1 of block:  512
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   1
Clock rate:                    1100000
Total constant memory:         65536
Texture alignment:             256
Concurrent copy and execution: Yes
Number of multiprocessors:     6
Kernel execution timeout:      Yes

Executing the following code sample takes ~750 ms on a GeForce GT540M whereas the same code executes in ~250 ms on a GT330M.

Copying the dev_a and dev_b to the CUDA device memory takes ~350 ms on the GT540M and ~250. The execution of "addCuda" and the copying back to the host takes another ~400 ms on GT540M and ~0 ms for the GT330M.

This is not what I expected, so I checked the devices' properties and discovered that the GT540M device surpasses or equals GT330M in every way except in the number of multiprocessors - GT540M has 2 and GT330M has 6. Can this really be true? And if so, can it really have such a great impact on the execution time?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define T 512
#define N 60000*T

__global__ void addCuda(double *a, double *b, double *c) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if(tid < N) {
        c[tid] = sqrt(fabs(a[tid] * b[tid] / 12.34567)) * cos(a[tid]);
    }
}

int main() {
    double *dev_a, *dev_b, *dev_c;

    double* a = (double*)malloc(N*sizeof(double));
    double* b = (double*)malloc(N*sizeof(double));
    double* c = (double*)malloc(N*sizeof(double));

    printf("Filling arrays (CPU)...\n\n");
    int i;
    for(i = 0; i < N; i++) {
        a[i] = (double)-i;
        b[i] = (double)i;
    }

    int timer = clock();
    cudaMalloc((void**) &dev_a, N*sizeof(double));
    cudaMalloc((void**) &dev_b, N*sizeof(double));
    cudaMalloc((void**) &dev_c, N*sizeof(double));
    cudaMemcpy(dev_a, a, N*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*sizeof(double), cudaMemcpyHostToDevice);

    printf("Memcpy time: %d\n", clock() - timer);
    addCuda<<<(N+T-1)/T,T>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, N*sizeof(double), cudaMemcpyDeviceToHost);

    printf("Time elapsed: %d\n", clock() - timer);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);

return 0;
}

The device properties for the devices:

GT540M:

Major revision number:         2
Minor revision number:         1
Name:                          GeForce GT 540M
Total global memory:           1073741824
Total shared memory per block: 49152
Total registers per block:     32768
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1344000
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     2
Kernel execution timeout:      Yes

GT330M

Major revision number:         1
Minor revision number:         2
Name:                          GeForce GT 330M
Total global memory:           268435456
Total shared memory per block: 16384
Total registers per block:     16384
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     512
Maximum dimension 0 of block:  512
Maximum dimension 1 of block:  512
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   1
Clock rate:                    1100000
Total constant memory:         65536
Texture alignment:             256
Concurrent copy and execution: Yes
Number of multiprocessors:     6
Kernel execution timeout:      Yes

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谎言月老 2025-01-12 08:07:44

我认为从设备到主机的复制时间不可能是 ~0ms。我建议检查该副本是否有错误

I think that it isn't possible for a copy from device to host to be ~0ms. I would suggest to check if there is stg wrong with that copy

寻找一个思念的角度 2025-01-12 08:07:44

查看多处理器的数量。

Look at the number of multiprocessors.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文