通过引用传递 CUDA 随机生成器状态
在函数CalculateValue(curandState *localStat)和GetExponential(curandState *localState)中通过引用传递随机生成器状态(CUDA工具包3.2 curand.lib)时,以下代码是否正确?
谢谢
__device__ double GetExponential(curandState *localState) {
double u1 = curand_uniform_double(localState); }
__device__ double CalculateValue(curandState *localStat) {
double x = GetExponential(localState);
return x; }
__global__ void RunMonteCarloKernel(curandState *state, double *results) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Copy state to local memory for efficiency */
curandState localState = state[threadIdx.x + blockIdx.x * blockDim.x];
results[i] = CalculateValue(&localState);
/* Copy state back to global memory */
state[threadIdx.x + blockIdx.x * blockDim.x] = localState; }
__global__ void setup_kernel(curandState *state) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Each thread gets different seed, a different sequence number, no offset */
curand_init(i, i, 0, &state[i]); }
int main(void) {
double *devResults;
curandState *devStates;
/* Allocate space for prng states on device */
CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads * sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<totalBlocks, threadsPerBlock>>>(devStates);
for(int i=0; i< 1000; i++)
{
RunMonteCarloKernel(devStates, devResults);
} }
Is the following code correct when passing the random generator state(CUDA toolkit 3.2 curand.lib) by reference in function CalculateValue(curandState *localStat) and GetExponential(curandState *localState)?
Thanks
__device__ double GetExponential(curandState *localState) {
double u1 = curand_uniform_double(localState); }
__device__ double CalculateValue(curandState *localStat) {
double x = GetExponential(localState);
return x; }
__global__ void RunMonteCarloKernel(curandState *state, double *results) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Copy state to local memory for efficiency */
curandState localState = state[threadIdx.x + blockIdx.x * blockDim.x];
results[i] = CalculateValue(&localState);
/* Copy state back to global memory */
state[threadIdx.x + blockIdx.x * blockDim.x] = localState; }
__global__ void setup_kernel(curandState *state) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Each thread gets different seed, a different sequence number, no offset */
curand_init(i, i, 0, &state[i]); }
int main(void) {
double *devResults;
curandState *devStates;
/* Allocate space for prng states on device */
CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads * sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<totalBlocks, threadsPerBlock>>>(devStates);
for(int i=0; i< 1000; i++)
{
RunMonteCarloKernel(devStates, devResults);
} }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有问题吗?看起来还不错。
您可能需要查看 3.2 SDK 的 MonteCarloCURAND 目录中的 EstimatePiInlineP 示例。它使用 C++ 风格的引用传递来避免获取局部变量的地址。您需要在内核末尾将状态存储回内存(就像您在代码中所做的那样)。
通过 C++ 引用传递可以帮助编译器清楚地表明该函数可以直接对原始寄存器中的数据进行操作。如果编译器不能确定所有线程都以相同的方式处理指针(即对指针进行相同的操作),则获取 GPU 中本地数组的地址可能会损害性能,在这种情况下,它将把数组溢出到本地内存。它会起作用,但可能会更慢。
Is there a problem? It looks ok.
You may want to check out the EstimatePiInlineP sample which is in the MonteCarloCURAND directory of the 3.2 SDK. It uses C++ style pass by reference to avoid taking the address of a local variable. You would need to store the state back to memory at the end of the kernel (as you do in your code).
Passing by C++ reference can assist the compiler by clearly showing that the function can operate on the data directly in the original registers. Taking the address of a local array in a GPU can be detrimental to performance if the compiler cannot be certain that all threads handle the pointer identically (i.e. identical operations on the pointer), in which case it will spill the array to local memory. It'll work, but it may be slower.