PyCUDA GPUArray 基于切片的操作
PyCUDA 文档对于我们这些“非大师”类的人来说示例有点简单,但我想知道可用于 gpuarrays 上的数组操作的操作,即。如果我想 gpuarray 这个循环;
m=np.random.random((K,N,N))
a=np.zeros_like(m)
b=np.random.random(N) #example
for k in range(K):
for x in range(N):
for y in range(N):
a[k,x,y]=m[k,x,y]*b[y]
常规的第一站Python减少类似于
for k in range(K):
for x in range(N):
a[k,x,:]=m[k,x,:]*b
但是除了编写自定义元素内核之外,我看不到任何简单的方法可以使用GPUArray来做到这一点,即使这样,遇到这个问题也必须有循环构造内核,在复杂性方面,我可能最好编写自己的完整 SourceModule 内核。
有人可以告诉我吗?
The PyCUDA documentation is a bit light on examples for those of us in the 'Non-Guru' class, but I'm wondering about the operations available for array operations on gpuarrays, ie. if I wanted to gpuarray this loop;
m=np.random.random((K,N,N))
a=np.zeros_like(m)
b=np.random.random(N) #example
for k in range(K):
for x in range(N):
for y in range(N):
a[k,x,y]=m[k,x,y]*b[y]
The regular first-stop python reduction for this would be something like
for k in range(K):
for x in range(N):
a[k,x,:]=m[k,x,:]*b
But I can't see any simple way to do this with GPUArray, other than writing a custom elementwise kernel, and even then with this problem there would have to be looping constructs in the kernel and at that point of complexity I'm probably better off just writing my own full blown SourceModule kernel.
Can anyone clue me in ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可能最好使用您自己的内核来完成。虽然 PyCUDA 的 gpuarray 类非常方便地将 GPU 内存抽象为可以与 numpy 数组互换使用的东西,但除了固定线性代数和并行归约运算之外,仍然需要为 GPU 进行编码。
也就是说,这是一个非常简单的小内核。如此微不足道,以至于会受到内存带宽的限制 - 您可能想看看是否可以将一些类似的操作“融合”在一起,以稍微提高 FLOPS 与内存事务的比率。
如果您需要一些有关内核的帮助,请发表评论,我可以扩展答案以包括一个粗略的原型。
That is probably best done with your own kernel. While PyCUDA's gpuarray class is a really convenient abstraction of GPU memory into something which can be used interchangeably with numpy arrays, there is no getting around the need to code for the GPU for anything outside of the canned linear algebra and parallel reduction operations.
That said, it is a pretty trivial little kernel to write. So trivial that it would be memory bandwidth bound - you might want to see if you can "fuse" a few like operations together to improve the ratio of FLOPS to memory transactions a bit.
If you need some help with the kernel, drop in a comment, and I can expand the answer to include a rough prototype.
您还可以使用
memcpy_dtod()
方法和 gpuarray 的切片功能。奇怪的是正常的赋值不起作用。set()
不起作用,因为它假定主机到设备传输(使用memcpy_htod()
)。You can also use the
memcpy_dtod()
method and the slicing functionality of gpuarrays. Its strange that normal assignment does not work.set()
does not work because it assumes host to device transfer (usingmemcpy_htod()
).