当前位置：文江博客话题详情

PyCUDA：C/C++包括？

发布于 2024-10-31 17:41:00 字数 184 浏览 7 评论 0原文

任何地方都没有真正提到（至少我可以看到）是哪些库函数暴露给内联 CUDA 内核。

具体来说，我正在执行小型/愚蠢的矩阵乘法，这些乘法不值得单独卸载到 GPU，但正在卸载包含此乘法的算法的较大部分。没有人喜欢使用自己的 linalg 函数，因为总有人做得更好。

TLDR 在 PyCUDA 下的内联内核中我可以使用哪些库？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵你的手，一向走下去 2024-11-07 17:41:00

我什么都不知道，但我一直认为拥有它会很有用。

对于我通常处理的问题的大小（有限元方法中出现的小矩阵和张量），我只是编写 C++ 模板来执行操作。模板化函数允许编译器在编译时知道行程计数，并且它可以展开循环并将结果或中间结果保存在寄存器中，这对于内核吞吐量来说往往非常有效。因此，矩阵-矩阵乘积被声明为

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

对于我的内核中出现的那种大小（2x2、3x3、4x4、8x8、9x9），执行上述操作并让编译解决问题似乎与其他任何方法一样好我尝试过的方法。因为在线程级别 CUDA 实际上是标量，所以没有任何矢量基元或类似的东西可用于加速此类小型操作。

I don't know of any, and I always thought it would be useful to have.

For the size of problems that I usually work with (small matrices and tensors that arise in the finite element method), I just wrote C++ templates to do the operations. Templating the functions allows the compiler to know the trip counts at compile time, and it can unroll loops and keep results or intermediate results in register, which tends to be very efficient for kernel throughput. So the matrix-matrix product gets declared as

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

For the sort of sizes that crop up in my kernels (2x2, 3x3, 4x4, 8x8, 9x9), doing the above and letting the compile work things out seems to be as good as any other approach I have tried. Because at the thread level CUDA is effectively scalar, there aren't any vector primitives or stuff like that which can be used to accelerate these sort of small operations.

回复收藏 0 原文

~没有更多了~