cuda上的128位整数?
我刚刚成功在 Linux Ubuntu 10.04 下安装了我的 cuda SDK。我的显卡是 NVIDIA geForce GT 425M,我想用它来解决一些繁重的计算问题。 我想知道的是:有没有办法使用一些无符号的 128 位 int var?当使用 gcc 在 CPU 上运行我的程序时,我使用的是 __uint128_t 类型,但将它与 cuda 一起使用似乎不起作用。 有什么办法可以让 cuda 上有 128 位整数吗?
I just managed to install my cuda SDK under Linux Ubuntu 10.04. My graphic card is an NVIDIA geForce GT 425M, and I'd like to use it for some heavy computational problem.
What I wonder is: is there any way to use some unsigned 128 bit int var? When using gcc to run my program on the CPU, I was using the __uint128_t type, but using it with cuda doesn't seem to work.
Is there anything I can do to have 128 bit integers on cuda?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
为了获得最佳性能,人们希望将 128 位类型映射到合适的 CUDA 向量类型(例如 uint4)之上,并使用 PTX 内联汇编来实现功能。加法看起来像这样:
乘法可以类似地使用 PTX 内联汇编来构建,方法是将 128 位数字分解为 32 位块,计算 64 位部分乘积并适当地相加。显然这需要一些工作。通过将数字分解为 64 位块并使用 __umul64hi() 与常规 64 位乘法和一些加法结合使用,可以在 C 级别获得合理的性能。这将导致以下结果:
下面是使用 PTX 内联汇编的 128 位乘法的版本。它需要 PTX 3.0(随 CUDA 4.2 一起提供),并且代码需要至少具有计算能力 2.0 的 GPU,即 Fermi 或 Kepler 类设备。该代码使用最少数量的指令,因为需要 16 个 32 位乘法来实现 128 位乘法。相比之下,上面使用 CUDA 内在函数的变体为 sm_20 目标编译为 23 条指令。
For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this:
The multiplication can similarly be constructed using PTX inline assembly by breaking the 128-bit numbers into 32-bit chunks, computing the 64-bit partial products and adding them appropriately. Obviously this takes a bit of work. One might get reasonable performance at the C level by breaking the number into 64-bit chunks and using __umul64hi() in conjuction with regular 64-bit multiplication and some additions. This would result in the following:
Below is a version of the 128-bit multiplication that uses PTX inline assembly. It requires PTX 3.0, which shipped with CUDA 4.2, and the code requires a GPU with at least compute capability 2.0, i.e. a Fermi or Kepler class device. The code uses the minimal number of instructions, as sixteen 32-bit multiplies are needed to implement a 128-bit multiplication. By comparison, the variant above using CUDA intrinsics compiles to 23 instructions for an sm_20 target.
CUDA 本身不支持 128 位整数。您可以使用两个 64 位整数自行伪造这些操作。
看看这篇文章:
CUDA doesn't support 128 bit integers natively. You can fake the operations yourself using two 64 bit integers.
Look at this post:
对于后代,请注意,从 11.5 开始,当主机编译器支持(例如,clang/gcc,但不支持 MSVC)时,CUDA 和 nvcc 在设备代码中支持
__int128_t
。 11.6 添加了对__int128_t
调试工具的支持。请参阅:
For posterity, note that as of 11.5, CUDA and nvcc support
__int128_t
in device code when the host compiler supports it (e.g., clang/gcc, but not MSVC). 11.6 added support for debug tools with__int128_t
.See:
一个迟来的答案,但您可以考虑使用这个库:
https://github.com/curtisseizert/ CUDA-uint128
定义了一个128位大小的结构,具有方法和独立的实用函数来使其按预期运行,这使得它可以像常规整数一样使用。大多。
A much-belated answer, but you could consider using this library:
https://github.com/curtisseizert/CUDA-uint128
which defines a 128-bit-sized structure, with methods and freestanding utility functions to get it to function as expected, which allow it to be used like a regular integer. Mostly.