如何在 CUDA Fortran 中分配共享内存数组?
我在尝试在内核中声明共享内存数组时遇到问题。这是包含我的内核的代码:
module my_kernels
use cudafor
implicit none
contains
attributes(global) subroutine mykernel(N)
! Declare variables
integer :: index
integer, intent(in), value :: N
real,shared,dimension(N) :: shared_array
! Map threadID to index
index = blockDim%x * (blockIdx%x-1) + threadIdx%x
! Set array element equal to index
shared_array(index) = index
end subroutine mykernel
end module my_kernels
以下是我如何调用我的内核:
program cuda
use my_kernels
implicit none
! Set number of threads
integer :: N = 9
! Invoke kernel with 3 blocks of 3 threads
call mykernel<<<N/3,3>>>(N)
end program cuda
所有这些都在一个文件 test.cuf 中。当我尝试使用 pgf90 编译 test.cuf 时,出现此错误:
PGF90-S-0000-Internal compiler error. unexpected runtime function call 0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure 349 (test.cuf: 34)
0 inform, 0 warnings, 2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"
/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression
2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted
在本例中,第 34 行引用 end subroutine mykernel
。编译器错误不是很有帮助,我花了一段时间才发现问题与共享数组有关(我使用此代码作为一个简单的示例)。
当我在共享数组的声明中将“N”替换为“9”时,将 real,shared,dimension(N) :: shared_array
替换为 real,shared,dimension(9 )::shared_array
,错误消失。
我的问题是,为什么会发生此错误,以及如何使用变量设置共享数组的维度(如果确实可能)?
I'm having trouble trying to declare a shared memory array within the kernel. Here's the code containing my kernel:
module my_kernels
use cudafor
implicit none
contains
attributes(global) subroutine mykernel(N)
! Declare variables
integer :: index
integer, intent(in), value :: N
real,shared,dimension(N) :: shared_array
! Map threadID to index
index = blockDim%x * (blockIdx%x-1) + threadIdx%x
! Set array element equal to index
shared_array(index) = index
end subroutine mykernel
end module my_kernels
And here's how I call my kernel:
program cuda
use my_kernels
implicit none
! Set number of threads
integer :: N = 9
! Invoke kernel with 3 blocks of 3 threads
call mykernel<<<N/3,3>>>(N)
end program cuda
All of this I have in one file, test.cuf. When I try to compile test.cuf with pgf90, I get this error:
PGF90-S-0000-Internal compiler error. unexpected runtime function call 0 (test.cuf: 34)
PGF90-S-0000-Internal compiler error. unsupported procedure 349 (test.cuf: 34)
0 inform, 0 warnings, 2 severes, 0 fatal for mykernel
/tmp/pgcudaforw5MgcaFALD9p.gpu(19): error: a value of type "int" cannot be assigned to an entity of type "float *"
/tmp/pgcudaforw5MgcaFALD9p.gpu(22): error: expected an expression
2 errors detected in the compilation of "/tmp/pgnvdl7MgHLY1VOV5.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (test.cuf: 34)
PGF90/x86-64 Linux 10.8-0: compilation aborted
In this case line 34 refers to end subroutine mykernel
. The compiler error is not very helpful, it took me a while to find out that the problem was to do with the shared array (I'm using this code as a simple example).
When I replace 'N' with '9' in the declaration of the shared array such that real,shared,dimension(N) :: shared_array
is replaced with real,shared,dimension(9) :: shared_array
, the error goes away.
My question is, why is this error occurring, and how do I set the dimension of a shared array with a variable (if indeed its possible)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
将“dimension(N)”更改为“dimension(*)”,然后传入共享数组的大小(以字节为单位)作为内核启动的第三个参数。
希望这有帮助,
马特
Change "dimension(N)" to "dimension(*)" and then pass in the size of shared array (in bytes) as the third argument of your kernel launch.
Hope this helps,
Mat
您可以拥有多个共享内存数组,但它们的大小必须在编译时已知。
一般来说,共享内存数组应该具有固定大小,在运行时可以以字节为单位传递大小的情况是例外的。
我猜这都是由于SM(流多处理器)中共享内存的限制造成的。
根据我在 CUDA C 和 CUDA fortran 中开发的经验,最好将所有这些参数“固定”,然后让内核根据需要多次重复工作以覆盖所有输入数据,这样我更容易控制所有参数影响占用率(GPU 中所有物理资源的使用情况)。
You can have more than one shared memory array, but their size must be known at compile time.
In general shared memory arrays should be of fixed size, the case where you can pass the size in bytes at runtime is kind of exceptional.
I guess this is all due to the limitation on shared memory in the SM (Stream Multiprocessor).
In my experience developing in both CUDA C and CUDA fortran is better to have all these parameters "fixed" and then make the kernel repeat the work as many times as needed to cover all input data, that way i easier to control all the paarmeters that affect the occupancy (how well you use all the physical resources in the GPU).