如何在 CUDA Fortran 中分配共享内存数组?

发布于 2024-10-06 13:29:08 字数 1800 浏览 1 评论 0原文

我在尝试在内核中声明共享内存数组时遇到问题。这是包含我的内核的代码:

module my_kernels

  use cudafor
  implicit none

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(in), value :: N
    real,shared,dimension(N) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

  end subroutine mykernel

end module my_kernels

以下是我如何调用我的内核:

program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3>>>(N)

end program cuda

所有这些都在一个文件 test.cuf 中。当我尝试使用 pgf90 编译 test.cuf 时,出现此错误:

PGF90-S-0000-Internal compiler error. unexpected runtime function call       0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure     349 (test.cuf: 34)
  0 inform,   0 warnings,   2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"

/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression

2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted

在本例中,第 34 行引用 end subroutine mykernel。编译器错误不是很有帮助,我花了一段时间才发现问题与共享数组有关(我使用此代码作为一个简单的示例)。

当我在共享数组的声明中将“N”替换为“9”时,将 real,shared,dimension(N) :: shared_array 替换为 real,shared,dimension(9 )::shared_array,错误消失。

我的问题是,为什么会发生此错误,以及如何使用变量设置共享数组的维度(如果确实可能)?

I'm having trouble trying to declare a shared memory array within the kernel. Here's the code containing my kernel:

module my_kernels

  use cudafor
  implicit none

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(in), value :: N
    real,shared,dimension(N) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

  end subroutine mykernel

end module my_kernels

And here's how I call my kernel:

program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3>>>(N)

end program cuda

All of this I have in one file, test.cuf. When I try to compile test.cuf with pgf90, I get this error:

PGF90-S-0000-Internal compiler error. unexpected runtime function call       0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure     349 (test.cuf: 34)
  0 inform,   0 warnings,   2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"

/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression

2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted

In this case line 34 refers to end subroutine mykernel. The compiler error is not very helpful, it took me a while to find out that the problem was to do with the shared array (I'm using this code as a simple example).

When I replace 'N' with '9' in the declaration of the shared array such that real,shared,dimension(N) :: shared_array is replaced with real,shared,dimension(9) :: shared_array, the error goes away.

My question is, why is this error occurring, and how do I set the dimension of a shared array with a variable (if indeed its possible)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

狼亦尘 2024-10-13 13:29:08

将“dimension(N)”更改为“dimension(*)”,然后传入共享数组的大小(以字节为单位)作为内核启动的第三个参数。

希望这有帮助,

马特

% cat test.cuf 
module my_kernels

  use cudafor
  implicit none

  real, dimension(:), allocatable,device :: Ad
  real, dimension(:),allocatable :: Ah

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(IN), value :: N
    real,shared,dimension(*) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

    Ad(index) = index

  end subroutine mykernel

end module my_kernels


program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

   allocate(Ad(N), Ah(N))

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3,N*4>>>(N)

  Ah=Ad
  print *, Ah

end program cuda

% pgf90 test.cuf -V10.9 ; a.out
    1.000000        2.000000        3.000000        4.000000     
    5.000000        6.000000        7.000000        8.000000     
    9.000000 

Change "dimension(N)" to "dimension(*)" and then pass in the size of shared array (in bytes) as the third argument of your kernel launch.

Hope this helps,

Mat

% cat test.cuf 
module my_kernels

  use cudafor
  implicit none

  real, dimension(:), allocatable,device :: Ad
  real, dimension(:),allocatable :: Ah

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(IN), value :: N
    real,shared,dimension(*) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

    Ad(index) = index

  end subroutine mykernel

end module my_kernels


program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

   allocate(Ad(N), Ah(N))

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3,N*4>>>(N)

  Ah=Ad
  print *, Ah

end program cuda

% pgf90 test.cuf -V10.9 ; a.out
    1.000000        2.000000        3.000000        4.000000     
    5.000000        6.000000        7.000000        8.000000     
    9.000000 
数理化全能战士 2024-10-13 13:29:08

您可以拥有多个共享内存数组,但它们的大小必须在编译时已知。
一般来说,共享内存数组应该具有固定大小,在运行时可以以字节为单位传递大小的情况是例外的。
我猜这都是由于SM(流多处理器)中共享内存的限制造成的。
根据我在 CUDA C 和 CUDA fortran 中开发的经验,最好将所有这些参数“固定”,然后让内核根据需要多次重复工作以覆盖所有输入数据,这样我更容易控制所有参数影响占用率(GPU 中所有物理资源的使用情况)。

You can have more than one shared memory array, but their size must be known at compile time.
In general shared memory arrays should be of fixed size, the case where you can pass the size in bytes at runtime is kind of exceptional.
I guess this is all due to the limitation on shared memory in the SM (Stream Multiprocessor).
In my experience developing in both CUDA C and CUDA fortran is better to have all these parameters "fixed" and then make the kernel repeat the work as many times as needed to cover all input data, that way i easier to control all the paarmeters that affect the occupancy (how well you use all the physical resources in the GPU).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文