cuda中的float与int
在 CUDA 中使用 float
代替 int
更好吗?
浮动
是否会减少银行冲突并确保合并? (或者与此无关?)
Is it better to use a float
instead of an int
in CUDA?
Does a float
decrease bank conflicts and insure coalescence? (or has it nothing to do with this?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
读取共享内存时的 Bank 冲突与读取的数据量有关。因此,由于
int
和float
大小相同(至少我认为它们在所有 CUDA 平台上都是如此),因此没有区别。合并通常指的是全局内存访问——同样,这与读取的字节数有关,而不是与数据类型有关。
Bank conflicts when reading shared memory are all about the amount of data read. So, since
int
andfloat
are the same size (at least I think they are on all CUDA platforms), there's no difference.Coalescence usually refers to global memory accesses - and again, this is to do with the number of bytes read, not the datatype.
int
和float
都是四个字节,因此在合并您的数据时使用它们没有任何区别(如果您以相同的方式访问它们)。全局内存访问或共享内存访问上的存储体冲突。话虽如此,使用
float
可能会获得更好的性能,因为设备旨在尽可能快地处理它们,int
通常用于控制和索引等。因此性能较低。当然,它实际上比这更复杂 - 如果除了浮点数什么都没有,那么整数硬件将闲置,这将是一种浪费。Both
int
andfloat
are four bytes, so it doesn't make any difference (if you're accessing them both the same way) which you use in terms of coalescing your global memory accesses or bank conflicts on shared memory accesses.Having said that, you may have better performance with
float
s since the devices are designed to crunch them as fast as possible,int
s are often used for control and indexes etc. and hence have lower performance. Of course it's really more complicated than that - if you had nothing but floats then the integer hardware would sit idle which would be a waste.存储体冲突和合并都与内存访问模式有关(扭曲内的线程是否以统一的步幅读/写到不同的位置)。因此,这些问题与数据类型(float、int、double 等)无关。
请注意,数据类型确实对计算性能有影响。单精度浮点比双精度等更快。GPU 中强大的 FPU 通常意味着进行定点计算是不必要的,甚至可能是有害的。
Bank conflicts and coalescence are all about memory access patterns (whether the threads within a warp all read/write to different locations with uniform stride). Thus, these concerns are independent of data type (float, int, double, etc.)
Note that data type does have an impact on the computation performance. Single precision float is faster than double precision etc. The beefy FPUs in the GPUs generally means that doing calculations in fixed point is unnecessary and may even be detrimental.
查看 CUDA 开发人员指南的“数学函数”部分。使用设备运行时函数(内部函数)可以为各种类型提供更好的性能。您可以在更少的时钟周期内在一个操作中执行多个操作。
你可能会发现更多像 x+c*y 这样的方法是用一个函数执行的。
Take a look at the "Mathematical Functions" section of CUDA Developers Guide. Using device runtime functions (intrinsic functions) may provide better performance for various types. You may perform multiple operations in one operation within less clock cycles.
And you may find more methods like x+c*y being performed with one function..