OpenCL 存储体冲突 - 丢失内存/损坏数据?
对于这个问题的含糊之处,我提前表示歉意。
背景:
我正在尝试用 OpenCL 编写形态学图像处理函数。我有一个 __local 缓冲区,用于存储每个像素的数据(每个像素由一个工作项表示,尚未展开循环)。另外,由于我处于早期测试阶段,因此我仅使用单个工作组(8x8 像素图像,以便我可以手动验证结果)。
问题:
有时必须将来自一个、两个、三个甚至四个像素的数据添加到另一个像素缓冲区中。由于这些是同一工作组中的相邻像素,因此我确信我导致了本地内存库冲突。没关系,速度还不是我的首要任务(目前为止!)。然而,这些银行冲突似乎正在丢失数据,甚至损坏数据。我一直非常小心,不要溢出或过度运行缓冲区。
所以,我的第一个问题是:银行冲突实际上是否有可能导致数据损坏和丢失? Opencl 规范似乎表明操作应该串行化,从而降低带宽 - 但没有提到数据丢失。
我的第二个问题是:救命! - 我能做什么呢?
任何指导将不胜感激 - 谢谢!
I apologize in advance for the vagueness of this question.
Background:
I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store data for every pixel (each pixel is represented by a work-item, no loop unrolling yet). Also, since I am early in testing, I am only using a single work-group (8x8 pixel image so I can manually validate results).
Problem:
There are occasions when data from one, two, three, or even four pixels must be added into the pixel buffer of another. Since these are adjacent pixel in the same workgroup, I am sure I am causing local memory bank conflicts. That's ok, speed isn't my top priority (yet!). However, these bank conflicts seem to be dropping data and even corrupting data. I've been very careful not to overflow or over run the buffers.
So, my first question is: is it, in fact, possible that the the bank conflicts are causing data corruption and loss? The Opencl spec seems to indicate that the operation should serialize, slowing down the bandwidth - but there is no mention of data loss.
My second question is: Help! - What can I do about this?
Any guidance will be greatly appreciated - thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能是nvidia白皮书前缀总和(扫描)使用 CUDA 可以带您走上正轨。它与
all-prefix-sums 算法
有关,该是一个很好的计算示例,该计算看似本质上是顺序的,但有一个高效的并行算法。
code>all-prefix-sums 操作 将数字列表
[3,4,1,2]
转换为它们的和:[0,3,7,8]
代码>.我知道这篇论文是关于 CUDA 的,但我发现生成的内核非常相似
两种技术都使用相似的概念。
我希望本文能对您有所帮助。
干杯
maybe the nvidia whitepaper Prefix Sum (Scan) with CUDA can bring you on the right track. It is about the
all-prefix-sums algorithm
, whichis a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm.
The
all-prefix-sums operation
turns lists of numbers[3,4,1,2]
into their sums:[0,3,7,8]
.I know the paper is about CUDA, but I found that the resulting kernels are very similar as
both tchnologies use similar concepts.
I hope, the paper can help you.
Cheers