CUDA:访问两个维度的任意长矩阵

发布于 2024-11-08 16:19:19 字数 1362 浏览 6 评论 0原文

嘿, 目前我正在使用仅在一维索引的线程来访问矩阵的所有元素,如下所示:

// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_ROWS * MATRIX_COLS)
{   
    row = idx % MATRIX_ROWS;
    col = idx / MATRIX_ROWS;    

    matrix[ row ][ col ] = ...;
    idx += offset;
}

现在我想知道如何使用二维索引访问任意长矩阵。我希望一个块始终访问一行的单个元素。类似的东西(x-index 指的是矩阵的列,y-index 指的是矩阵的行):

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_COLS)
{   
    matrix[ idy ][ idx ] = ...;
    idx += offset;
}

现在让我们假设矩阵的行数比调用内核时开始块的行数多: 当启动 N 个块时,前 N 个块矩阵的行处理正确,但其他行呢?你会怎么做? 谢谢!

编辑:我想出了一个想法,但我不知道这是否是某种“丑陋”的编码!?

// Thread-IDs
int idx0 = blockIdx.x * blockDim.x + threadIdx.x;
int idx = idx0;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

while(idx < MATRIX_COLS && idy < MATRIX_ROWS)
{   
    matrix[ idy ][ idx ] = ...;

    idx += offsetx;
    if(idx > MATRIX_COLS)
    {
        // Jump to nex row and start from 'beginning' concerning columns
        idy += offsety;
        idx = idx0;
    }
}

Hey there,
currently I'm using threads indexed only in one dimension to access all elements of a matrix like that:

// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_ROWS * MATRIX_COLS)
{   
    row = idx % MATRIX_ROWS;
    col = idx / MATRIX_ROWS;    

    matrix[ row ][ col ] = ...;
    idx += offset;
}

Now I wondered how to access arbitrary long matrices with two dimensional indexing. I would like that one block always access the single elements of one row. Something like that (x-index is refering to the cols and the y-index to the rows of the matrix):

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_COLS)
{   
    matrix[ idy ][ idx ] = ...;
    idx += offset;
}

Now let's assume the matrix has more rows than I started blocks when calling the kernel: When starting N blocks, the first N rows of the matrix are handled right, but what about the other rows? How would you do that?
Thanks!

EDIT: I came up with an idea but I don't know if that's somehow 'ugly' coding!?

// Thread-IDs
int idx0 = blockIdx.x * blockDim.x + threadIdx.x;
int idx = idx0;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

while(idx < MATRIX_COLS && idy < MATRIX_ROWS)
{   
    matrix[ idy ][ idx ] = ...;

    idx += offsetx;
    if(idx > MATRIX_COLS)
    {
        // Jump to nex row and start from 'beginning' concerning columns
        idy += offsety;
        idx = idx0;
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尸血腥色 2024-11-15 16:19:19

也许这样的东西就是您正在寻找的?

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

for(row = idy; row < MATRIX_ROWS; i+=offsety) {
    float * pos = matrix + row;

#pragma unroll
    for(col = idx; col < MATRIX_COLS; col+=offsetx) {
        pos[col] = .....;
    }
}

如果 MATRIX_COLS 是预处理器定义或常量,编译器可能能够展开内部循环并提供更多性能。

编辑:代码的第一个版本是用列优先顺序编写的,这在我的脑海中留下了深刻的印象,因此请忽略此处的评论。应该以这种方式合并访问。

Perhaps something like this is what you are looking for?

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

for(row = idy; row < MATRIX_ROWS; i+=offsety) {
    float * pos = matrix + row;

#pragma unroll
    for(col = idx; col < MATRIX_COLS; col+=offsetx) {
        pos[col] = .....;
    }
}

If MATRIX_COLS is a preprocessor define or constant, the compiler might be able to unroll the inner loop and give a bit more performance.

EDIT: The first version of the code was written with column-major ordering stuck in the back of my head, so ignore the comment that was here. Access should be coalesced doing it this way.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文