如何在 CUDA Fortran 中分配共享内存数组？

发布于 2024-10-06 13:29:08 字数 1800 浏览 1 评论 0原文

我在尝试在内核中声明共享内存数组时遇到问题。这是包含我的内核的代码：

module my_kernels

  use cudafor
  implicit none

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(in), value :: N
    real,shared,dimension(N) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

  end subroutine mykernel

end module my_kernels

以下是我如何调用我的内核：

program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3>>>(N)

end program cuda

所有这些都在一个文件 test.cuf 中。当我尝试使用 pgf90 编译 test.cuf 时，出现此错误：

PGF90-S-0000-Internal compiler error. unexpected runtime function call       0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure     349 (test.cuf: 34)
  0 inform,   0 warnings,   2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"

/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression

2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted

在本例中，第 34 行引用 end subroutine mykernel。编译器错误不是很有帮助，我花了一段时间才发现问题与共享数组有关（我使用此代码作为一个简单的示例）。

当我在共享数组的声明中将“N”替换为“9”时，将 real,shared,dimension(N) :: shared_array 替换为 real,shared,dimension(9 )::shared_array，错误消失。

我的问题是，为什么会发生此错误，以及如何使用变量设置共享数组的维度（如果确实可能）？

原文

I'm having trouble trying to declare a shared memory array within the kernel. Here's the code containing my kernel:

module my_kernels

  use cudafor
  implicit none

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(in), value :: N
    real,shared,dimension(N) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

  end subroutine mykernel

end module my_kernels

And here's how I call my kernel:

program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3>>>(N)

end program cuda

All of this I have in one file, test.cuf. When I try to compile test.cuf with pgf90, I get this error:

PGF90-S-0000-Internal compiler error. unexpected runtime function call       0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure     349 (test.cuf: 34)
  0 inform,   0 warnings,   2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"

/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression

2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted

In this case line 34 refers to end subroutine mykernel. The compiler error is not very helpful, it took me a while to find out that the problem was to do with the shared array (I'm using this code as a simple example).

When I replace 'N' with '9' in the declaration of the shared array such that real,shared,dimension(N) :: shared_array is replaced with real,shared,dimension(9) :: shared_array, the error goes away.

My question is, why is this error occurring, and how do I set the dimension of a shared array with a variable (if indeed its possible)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狼亦尘 2024-10-13 13:29:08

将“dimension(N)”更改为“dimension(*)”，然后传入共享数组的大小（以字节为单位）作为内核启动的第三个参数。

希望这有帮助，

马特

% cat test.cuf 
module my_kernels

  use cudafor
  implicit none

  real, dimension(:), allocatable,device :: Ad
  real, dimension(:),allocatable :: Ah

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(IN), value :: N
    real,shared,dimension(*) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

    Ad(index) = index

  end subroutine mykernel

end module my_kernels


program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

   allocate(Ad(N), Ah(N))

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3,N*4>>>(N)

  Ah=Ad
  print *, Ah

end program cuda

% pgf90 test.cuf -V10.9 ; a.out
    1.000000        2.000000        3.000000        4.000000     
    5.000000        6.000000        7.000000        8.000000     
    9.000000

Change "dimension(N)" to "dimension(*)" and then pass in the size of shared array (in bytes) as the third argument of your kernel launch.

Hope this helps,

Mat

% cat test.cuf 
module my_kernels

  use cudafor
  implicit none

  real, dimension(:), allocatable,device :: Ad
  real, dimension(:),allocatable :: Ah

contains

  attributes(global) subroutine mykernel(N)

    ! Declare variables
    integer :: index
    integer, intent(IN), value :: N
    real,shared,dimension(*) :: shared_array  

    ! Map threadID to index
    index = blockDim%x * (blockIdx%x-1) + threadIdx%x

    ! Set array element equal to index
    shared_array(index) = index

    Ad(index) = index

  end subroutine mykernel

end module my_kernels


program cuda

  use my_kernels
  implicit none  

  ! Set number of threads
  integer :: N = 9

   allocate(Ad(N), Ah(N))

  ! Invoke kernel with 3 blocks of 3 threads
  call mykernel<<<N/3,3,N*4>>>(N)

  Ah=Ad
  print *, Ah

end program cuda

% pgf90 test.cuf -V10.9 ; a.out
    1.000000        2.000000        3.000000        4.000000     
    5.000000        6.000000        7.000000        8.000000     
    9.000000

回复收藏 0 原文

数理化全能战士 2024-10-13 13:29:08

您可以拥有多个共享内存数组，但它们的大小必须在编译时已知。
一般来说，共享内存数组应该具有固定大小，在运行时可以以字节为单位传递大小的情况是例外的。
我猜这都是由于SM（流多处理器）中共享内存的限制造成的。
根据我在 CUDA C 和 CUDA fortran 中开发的经验，最好将所有这些参数“固定”，然后让内核根据需要多次重复工作以覆盖所有输入数据，这样我更容易控制所有参数影响占用率（GPU 中所有物理资源的使用情况）。

回复收藏 0 原文

~没有更多了~