Cayman 科学计算架构中的本地数据存储与纹理缓存
我正在尝试在 ATI HD 6990 卡(Cayman 架构)上使用 AMD-APP-SDK 2.4 实现 GEMM 实现。
优化技术之一是使用分块/平铺。
在其实现中,如果我们将子矩阵存储在共享本地内存中更快,还是使用纹理缓存更快?如果可以的话也请说明理由。
还请建议哪个更容易实现。
谢谢。
PS 如果重要的话,我只想要单精度!
注意:子矩阵的大小不是问题,但我觉得它越大越好。唯一需要考虑的因素是,如果内存单位是128位(4个单精度),那么块大小应该是4的倍数。
I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
Thanks.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration is that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Cypress 芯片用于 5800 系列 Radeons。 6900系列使用Cayman内核,它有几个重要的区别,最值得注意的是它是VLIW4架构,而不是早期内核中使用的VLIW5配置。
与往常一样,了解哪种方法更快的唯一确定方法是对其进行基准测试。特别是,由于您没有提供有关子矩阵大小的信息,因此很难说它们最适合在哪里。
The Cypress chips were used in the 5800 series Radeons. The 6900 series uses the Cayman core, which has several important differences, most notably that it is a VLIW4 architecture instead of the VLIW5 configuration used in earlier cores.
As always, the only definitive way to know which method is faster is to benchmark it. In particular, since you give no information about the size of the sub-matrices, it is hard to say where they will best fit.