cuda 中的 memset 允许在内核中设置值
我正在进行几次 cudamemset 调用,以便将我的值设置为 0,如下所示:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
但是,事实证明这非常慢。 GPU 上是否有 memset 函数,因为从 CPU 调用它需要大量时间。另外,cudaMalloc((void**)&c, byte*sizeof(char)) 是否自动将 c 指向的位设置为 0。
i am making several cudamemset calls in order to set my values to 0 as below:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
每个
cudaMemset
调用都会启动一个内核,因此如果N
很大并且byte
很小,那么您将有很多内核启动开销,从而减慢速度代码。没有设备端memset
,因此解决方案是编写一个内核,在一次启动中遍历分配并将存储清零。顺便说一句,我强烈建议不要在 CUDA 中使用数组结构。使用单个大块线性内存并索引到该内存来实现相同的结果要慢得多且复杂得多。在您的示例中,它将代码减少为单个
cudaMalloc
调用和单个cudaMemset
调用。在设备端,速度很慢的指针间接寻址被一些非常快的整数操作所取代。如果主机上的源材料是结构数组,我建议使用类似优秀的 thrust::zip_iterator 将数据转换为设备上 GPU 友好的形式。Every
cudaMemset
call launches a kernel, so ifN
is large andbyte
is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device sidememset
, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single
cudaMalloc
call and a singlecudaMemset
call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.