3D 图像索引

发布于 2024-12-02 21:53:07 字数 197 浏览 0 评论 0原文

我有一张尺寸为 512 x 512 x 512 的图像。 我需要单独处理所有体素。 我怎样才能获得线程ID来做到这一点? 如果我使用 1D 线程 ID,块数将超过 65536。

    int id = blockIdx.x*blockDim.x + threadIdx.x;

注意:- 我的卡不支持 3D 网格

I have an image of size 512 x 512 x 512.
I need to process all the voxels individually.
How can I get the thread id to do this?
If I use 1D thread ID the number of blocks will exceeds 65536.

    int id = blockIdx.x*blockDim.x + threadIdx.x;

Note :- My card doesnt support for the 3D grids

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

∞琼窗梦回ˉ 2024-12-09 21:53:07

您可以在 CUDA 4.0 和计算能力 2.0+ 中使用 3D 索引。示例代码:

int blocksInX = (nx+8-1)/8;
int blocksInY = (ny+8-1)/8;
int blocksInZ = (nz+8-1)/8;

dim3 Dg(blocksInX, blocksInY, blocksInZ);
dim3 Db(8, 8, 8);
foo_kernel<<Dg, Db>>(R, nx, ny, nz);

...

__global__ void foo_kernel( float* R, const int nx, const int ny, const int nz )
{
  unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
  unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;

  if ( (xIndex < nx) && (yIndex < ny) && (zIndex < nz) )
  {
    unsigned int index_out = xIndex + nx*yIndex + nx*ny*zIndex;
    ...
    R[index_out] = ...;
  }
}

如果您的设备不支持计算能力 2.0,则有一些技巧:

int threadsInX = 16;
int threadsInY = 4;
int threadsInZ = 4;

int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;
int blocksInZ = (nz+threadsInZ-1)/threadsInZ;

dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);

foo_kernel<<<Dg, Db>>>(R, nx, ny, nz, blocksInY, 1.0f/(float)blocksInY);

__global__ void foo_kernel(float *R, const int nx, const int ny, const int nz,
                           unsigned int blocksInY, float invBlocksInY)
{

    unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);
    unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY);
    unsigned int xIndex = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
    unsigned int yIndex = __umul24(blockIdxy, blockDim.y) + threadIdx.y;
    unsigned int zIndex = __umul24(blockIdxz, blockDim.z) + threadIdx.z;

    if ( (xIndex < nx) && (yIndex < xIndex) && (zIndex < nz) )
    {
        unsigned int index = xIndex + nx*yIndex + nx*ny*zIndex;
        ...
        R[index] = ...;
    }

}

You are able to use 3D indicies in CUDA 4.0 and compute capability 2.0+. Example code:

int blocksInX = (nx+8-1)/8;
int blocksInY = (ny+8-1)/8;
int blocksInZ = (nz+8-1)/8;

dim3 Dg(blocksInX, blocksInY, blocksInZ);
dim3 Db(8, 8, 8);
foo_kernel<<Dg, Db>>(R, nx, ny, nz);

...

__global__ void foo_kernel( float* R, const int nx, const int ny, const int nz )
{
  unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
  unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;

  if ( (xIndex < nx) && (yIndex < ny) && (zIndex < nz) )
  {
    unsigned int index_out = xIndex + nx*yIndex + nx*ny*zIndex;
    ...
    R[index_out] = ...;
  }
}

If your device doesn't support compute capability 2.0, there is some trick:

int threadsInX = 16;
int threadsInY = 4;
int threadsInZ = 4;

int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;
int blocksInZ = (nz+threadsInZ-1)/threadsInZ;

dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);

foo_kernel<<<Dg, Db>>>(R, nx, ny, nz, blocksInY, 1.0f/(float)blocksInY);

__global__ void foo_kernel(float *R, const int nx, const int ny, const int nz,
                           unsigned int blocksInY, float invBlocksInY)
{

    unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);
    unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY);
    unsigned int xIndex = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
    unsigned int yIndex = __umul24(blockIdxy, blockDim.y) + threadIdx.y;
    unsigned int zIndex = __umul24(blockIdxz, blockDim.z) + threadIdx.z;

    if ( (xIndex < nx) && (yIndex < xIndex) && (zIndex < nz) )
    {
        unsigned int index = xIndex + nx*yIndex + nx*ny*zIndex;
        ...
        R[index] = ...;
    }

}
等待我真够勒 2024-12-09 21:53:07

你可以使用网格。它为您提供了更多索引。

You could use grids. It gives you much more indexes.

夜还是长夜 2024-12-09 21:53:07

请注意,您的 PC 内存不是 3D 的。这只是可视化的问题,因此您可以将 3D 图像转换为单个指针。

Array[i][j][z] is same as Array2[ i*cols+j + rows*cols*z];

现在将 Array2 馈送到 CUDA 并在单维中工作

Note that the memory of your PC is not in 3D. It's just the matter of visualization, so you can convert your 3D image into a single pointer.

Array[i][j][z] is same as Array2[ i*cols+j + rows*cols*z];

Now feed the Array2 to CUDA and work in single dimension

混浊又暗下来 2024-12-09 21:53:07

如果您需要更大的网格,CUDA 在所有硬件上支持 2D 网格,并且最新版本的 CUDA 工具包还支持当前 Fermi 硬件上的 3D 网格。

然而,并不是绝对有必要拥有这么大的网格。如果每个体素操作都是独立的,那么为什么不只使用一维网格,而是让每个线程处理多个体素呢?这样的方案不仅不需要更大的 2D 或 3D 网格,而且可能会更有效,因为与块的调度和初始化相关的固定成本可以通过多个体素计算进行摊销。

If you need a larger grid, CUDA supports 2D grids on all hardware, and the most recent versions of the CUDA toolkit also support 3D grids on current Fermi hardware.

However, it isn't strictly necessary to have such large grids. If each voxel operation is independent, then why not just use a 1D grid, but have each thread process more than one voxel? Not only would such a scheme not need larger 2D or 3D grids, it might well be more efficient because the fixed costs associated with scheduling and initialization of a block can be amortized over multiple voxel calculations.

纸伞微斜 2024-12-09 21:53:07

我使用了这样的东西:

在代码中定义你的网格:
暗淡 3 替代网格,替代线程;
altgrid.x=lx;
altgrid.y=ly;
altgrid.z=1;
altthreads.x=lz;
altthreads.y=1;
altthreads.z​​=1;

并且在内核中

int idx = threadIdx.x;
int idy = blockIdx.x ;
int idz = blockIdx.y ;

由于设备上的数组仅为 1D,因此您可以通过矩阵 A 检索 [idx][idy][idz] 元素作为 A[ind],其中 ind=idz+lz*(idy+ly*idx );

我希望它有帮助

I used something like this:

In the code define your grid:
dim3 altgrid,altthreads;
altgrid.x=lx;
altgrid.y=ly;
altgrid.z=1;
altthreads.x=lz;
altthreads.y=1;
altthreads.z=1;

and in the kernel

int idx = threadIdx.x;
int idy = blockIdx.x ;
int idz = blockIdx.y ;

Since the array in on device is only 1D you retrieve the [idx][idy][idz] element by of a matrix A as A[ind], where ind=idz+lz*(idy+ly*idx);

I hope it helps

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文