从 CUDA 中的高斯分布生成随机数

发布于 2024-10-13 02:47:31 字数 1652 浏览 8 评论 0原文

我在互联网上进行了大量搜索,寻找一种在我的 CUDA 设备上的内核中生成随机数的方法。这些数字必须来自高斯分布。

我发现的最好的东西来自 NVIDIA 本身。这是华莱士算法,它使用均匀分布来构建高斯分布。但他们给出的代码示例缺乏解释,我真的需要了解算法是如何运行的,尤其是在设备上。例如,他们给出:

 __device__ void generateRandomNumbers_wallace(  
unsigned seed,  // Initialization seed  
 float *chi2Corrections,  // Set of correction values  
 float *globalPool,  // Input random number pool  
 float *output  // Output random numbers  


    unsigned tid=threadIdx.x;  
    // Load global pool into shared memory.  
     unsigned offset = __mul24(POOL_SIZE, blockIdx.x);  
    for( int i = 0; i < 4; i++ )  
      pool[tid+THREADS*i] = globalPool[offset+TOTAL_THREADS*i+tid];  
    __syncthreads();  
      const unsigned lcg_a=241;  
      const unsigned lcg_c=59;  
      const unsigned lcg_m=256;  
      const unsigned mod_mask = lcg_m-1;  
      seed=(seed+tid)&mod_mask ;  
      // Loop generating outputs repeatedly  
     for( int loop = 0; loop < OUTPUTS_PER_RUN; loop++ )  
      {  
        Transform();  
        unsigned intermediate_address;  
        i_a = __mul24(loop,8*TOTAL_THREADS)+8*THREADS *  
          blockIdx.x + threadIdx.x;  
        float chi2CorrAndScale=chi2Corrections[  
          blockIdx.x * OUTPUTS_PER_RUN + loop];  
        for( i = 0; i < 4; i++ )  
          output[i_a + i*THREADS]=chi2CorrAndScale*pool[tid+THREADS*i];  
    }  

首先,许多声明的变量甚至没有在函数中使用!我真的不明白第二个循环中的“8”是什么意思。我知道其他循环中的“4”与 4x4 正交矩阵块有关,对吗?谁能让我更好地了解这里发生的事情?

无论如何,有人有我可以使用的好的代码示例吗?或者有人有另一种在 CUDA 内核中生成随机高斯数的方法吗?代码示例将不胜感激。

谢谢!

I've searched a lot over the internet to find a way to generate random numbers on my CUDA device, within a kernel. The numbers must come from a gaussian distribution.

The best thing I found was from NVIDIA itself. It is the Wallace algorithm, that uses a uniform distribution to build a gaussian one. But the code samples they give lack explanation and I really need to understand how the algorithm goes, especially on the device. For example, they give:

 __device__ void generateRandomNumbers_wallace(  
unsigned seed,  // Initialization seed  
 float *chi2Corrections,  // Set of correction values  
 float *globalPool,  // Input random number pool  
 float *output  // Output random numbers  


    unsigned tid=threadIdx.x;  
    // Load global pool into shared memory.  
     unsigned offset = __mul24(POOL_SIZE, blockIdx.x);  
    for( int i = 0; i < 4; i++ )  
      pool[tid+THREADS*i] = globalPool[offset+TOTAL_THREADS*i+tid];  
    __syncthreads();  
      const unsigned lcg_a=241;  
      const unsigned lcg_c=59;  
      const unsigned lcg_m=256;  
      const unsigned mod_mask = lcg_m-1;  
      seed=(seed+tid)&mod_mask ;  
      // Loop generating outputs repeatedly  
     for( int loop = 0; loop < OUTPUTS_PER_RUN; loop++ )  
      {  
        Transform();  
        unsigned intermediate_address;  
        i_a = __mul24(loop,8*TOTAL_THREADS)+8*THREADS *  
          blockIdx.x + threadIdx.x;  
        float chi2CorrAndScale=chi2Corrections[  
          blockIdx.x * OUTPUTS_PER_RUN + loop];  
        for( i = 0; i < 4; i++ )  
          output[i_a + i*THREADS]=chi2CorrAndScale*pool[tid+THREADS*i];  
    }  

First of all, many of the variables declared aren't even used in the function! And I really don't get what the "8" is for in the second loop. I understand the "4" in the other loops have something to do with the 4x4 orthogonal matrix block, am I right? Could anyone give me a better idea of what is going on here?

Anyway, does anyone have any good code samples I could use? Or does anyone have another way of generating random gaussian numbers in a CUDA kernel? Code samples will be much appreciated.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

黯然 2024-10-20 02:47:31

您可以使用 CURAND,它包含在CUDA 工具包(版本 3.2 及更高版本)。那就简单多了!

关于您发布的代码的一些注释:

  • 华莱士生成器将高斯变换为高斯(即不是均匀到高斯)
  • CUDA 代码有两个隐式变量:blockIdxthreadIdx - 这些定义带块的块索引和线程索引,请参阅 CUDA编程指南了解更多信息
  • 该代码在 sm_20 和更高版本上使用 __mul24,这实际上比“普通”32 位乘法慢,因此我会避免使用它(即使在较旧的体系结构上,为了简单起见)

You could use CURAND, which is included with the CUDA Toolkit (version 3.2 and later). It'd be far simpler!

A few notes on the code you posted:

  • The Wallace generator transforms Gaussian to Gaussian (i.e. not Uniform to Gaussian)
  • CUDA code has two implicit variables: blockIdx and threadIdx - these define the block index and thread index with a block, see the CUDA Programming Guide for more information
  • The code uses __mul24, on sm_20 and later this is actually slower than "ordinary" 32-bit multiplication so I would avoid it (even on older architectures for simplicity)
⊕婉儿 2024-10-20 02:47:31

Box-Muller 方法也不错。

The Box-Muller method is also good.

故人如初 2024-10-20 02:47:31

快速 Walsh Hadamard 变换是通过加法和减法模式完成的。因此中心极限定理适用。经过 Walsh Hadamard 变换的均匀随机数数组将具有高斯/正态分布。关于这一点有一些轻微的技术细节。该算法不是华莱士发现的。它是我自己于 1993/1994 年左右首次在 Servo Magazine 上发表的。
我在 www.code.google.com/p/lemontree 上有关于 Walsh Hadamard 变换的代码
问候,
肖恩·奥康纳

The Fast Walsh Hadamard transform is done by patterns of addition and subtraction. Hence the central limit theorem applies. An array of uniform random numbers that undergoes a Walsh Hadamard transformation will have a Gaussian/Normal distribution. There are some slight technical details about that. The algorithm was not discovered by Wallace. It was first published in Servo Magazine around 1993/1994 by myself.
I have code about the Walsh Hadamard transform at www.code.google.com/p/lemontree
Regards,
Sean O'Connor

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文