Mathematica 中的 CUDAFunctionLoad - 索引问题
我正在尝试调试 CUDA 机器上遇到的索引问题
Cuda Machine Info:
{1->{名称->Tesla C2050,时钟频率->1147000,计算能力->2.,GPU重叠->1,最大块尺寸->{1024,1024,64 },最大网格尺寸->{65535,65535,65535},每块最大线程->1024,每块最大共享内存->49152,总常量内存->65536,Warp Size->32,最大间距-> ;2147483647,每个最大寄存器块->32768,纹理对齐->512,多处理器计数->14,核心计数->448,执行超时->0,集成->False,可以映射主机内存->True,计算模式->默认,Texture1D 宽度->65536,Texture2D宽度->65536,Texture2D 高度->65535,Texture3D 宽度->2048,Texture3D 高度->2048,Texture3D 深度->2048,Texture2D 数组宽度->16384,Texture2D 数组高度->16384,纹理二维数组切片->2048,表面对齐->512,并发内核->True,ECC启用->True,总内存->2817982462},
所有这些代码所做的就是将 3D 数组的值设置为等于 CUDA 正在使用的索引:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
出于某种原因,一旦尺寸我的 3D 数组变得太大,索引停止。
我尝试了不同的块尺寸(blockDim.x by blockDim.y by blockDim.z):
8x8x8 仅给出数组维度 12x12x12 的正确索引
9x9x9 只给出数组维度 14x14x14 的正确索引
10x10x10 只给出数组维度 15x15x15 的正确索引
对于大于这些的尺寸,所有不同的块大小最终开始增加再次,但它们永远不会达到 dim^3-1 的值(这是 cuda 线程应达到的最大索引)
以下是说明此行为
的一些图: 例如: 这是在 x 轴上绘制3D 数组(即 xxx),y 轴上是 cuda 执行期间处理的最大索引号。此特定图适用于尺寸为 10x10x10 的块。
这是生成该图的(Mathematica)代码,但是当我运行这个代码时,我使用了 block尺寸为 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
这是相同的图,但现在绘制 x^3 以与应有的位置进行比较。请注意,在数组的维度 > 32 之后它会发散
我测试了 3D 数组的维度看看索引有多远,并将其与 dim^3-1 进行比较。例如,对于dim=32,cuda最大索引是32767(即32^3 -1),但对于dim=33,cuda输出是33791,而它应该是35936(33^3 -1)。请注意 33791-32767 = 1024 = blockDim.x
问题:
有没有办法正确索引维度大于 Mathematica 中块维度的数组?
现在,我知道有些人在索引方程中使用 __mul24(threadIdx.y,blockDim.x) 来防止位乘法中的错误,但它似乎对我的情况没有帮助。
另外,我看到有人提到您应该使用 -arch=sm_11 编译代码,因为默认情况下它是针对计算能力 1.0 进行编译的。我不知道 Mathematica 中是否是这种情况。我假设 CUDAFunctionLoad[] 知道使用 2.0 功能进行编译。有人知道吗?
任何建议都会非常有帮助!
I am trying to debug an index problem I am having on my CUDA machine
Cuda Machine Info:
{1->{Name->Tesla C2050,Clock Rate->1147000,Compute Capabilities->2.,GPU Overlap->1,Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{65535,65535,65535},Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,Maximum Registers Per Block->32768,Texture Alignment->512,Multiprocessor Count->14,Core Count->448,Execution Timeout->0,Integrated->False,Can Map Host Memory->True,Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65535,Texture3D Width->2048,Texture3D Height->2048,Texture3D Depth->2048,Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->True,Total Memory->2817982462},
All this code does is set the values of a 3D array equal to the index that CUDA is using:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
For some reason, once the dimension of my 3D array becomes too large, the indexing stops.
I have tried different block dimensions (blockDim.x by blockDim.y by blockDim.z):
8x8x8 only gives correct indexing up to array dimension 12x12x12
9x9x9 only gives correct indexing up to array dimension 14x14x14
10x10x10 only gives correct indexing up to array dimension 15x15x15
For dimensions larger than these all of the different block sizes eventually start to increase again, but they never reach a value of dim^3-1 (which is the maximum index that the cuda thread should reach)
Here are some plots that illustrate this behavior:
For example: This is plotting on the x axis the dimension of the 3D array (which is xxx), and on the y axis the maximum index number that is processed during the cuda execution. This particular plot is for block dimensions of 10x10x10.
Here is the (Mathematica) code to generate that plot, but when I ran this one, I used block dimensions of 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
This is the same plot, but now plotting x^3 to compare to where it should be. Notice that it diverges after the dimension of the array is >32
I test the dimensions of the 3D array and look at how far the indexing goes and compare it with dim^3-1. E.g. for dim=32, the cuda max index is 32767 (which is 32^3 -1), but for dim=33 the cuda output is 33791 when it should be 35936 (33^3 -1). Notice that 33791-32767 = 1024 = blockDim.x
Question:
Is there a way to correctly index an array with dimensions larger than the block dimensions in Mathematica?
Now, I know that some people use __mul24(threadIdx.y,blockDim.x) in their index equation to prevent errors in bit multiplication, but it doesn't seem to help in my case.
Also, I have seen someone mention that you should compile your code with -arch=sm_11 because by default it's compiled for compute capability 1.0. I don't know if this is the case in Mathematica though. I would assume that CUDAFunctionLoad[] knows to compile with 2.0 capability. Any one know?
Any suggestions would be extremely helpful!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此,Mathematica 有一种隐藏的方式来处理网格尺寸,为了将网格尺寸固定为有效的尺寸,您必须在调用的函数末尾添加另一个数字。
该参数表示要启动的线程数(或网格维度乘以块维度)。
例如,在我上面的代码中:
(8,8,8) 表示块的维度。
当您在mathematica中调用
CUDAExp[]
时,您可以添加一个参数来表示要启动的线程数:在这个示例中,我最终让它与以下内容一起工作:
请注意,当您使用CUDAFunctionLoad进行编译时[],它只需要 5 个输入,第一个是您传递给它的数组(维度为
dim x dim x dim
),第二个是存储它的内存的位置。第三、第四、第五是维度。当你传递第六个时,mathematica 会将其转换为 gridDim.x * blockDim.x ,所以,因为我知道我需要 gridDim.x = 512 才能处理数组中的每个元素,我将此数字设置为 512 * 8 = 4089。
我希望这对于将来遇到此问题的人来说是清楚且有用的。
So, Mathematica kind of has a hidden way of dealing with grid dimensions, to fix your grid dimension to something that will work, you have to add another number to the end of the function you are calling.
The argument denotes the number of threads to launch (or grid dimension times block dimension).
For example, in my code above:
(8,8,8) denotes the dimension of the block.
When you call
CUDAExp[]
in mathematica, you can add an argument that denotes the number of threads to launch:In this example I finally got it to work with the following:
Note that when you compile with CUDAFunctionLoad[], it only expects 5 inputs, the first is the array you pass it (of dimensions
dim x dim x dim
) and the second is where the memory of it is stored. The third, fourth, and fifth are the dimensions.When you pass it a 6th, mathematica translates that as
gridDim.x * blockDim.x
, so, since I know I need gridDim.x = 512 in order for every element in the array to be dealt with, I set this number equal to 512 * 8 = 4089.I hope this is clear and useful to someone in the future that comes across this issue.