Opencl,使用指向不同本地内存的全局指针会创建分支吗?
我有一个像这样的内核
kernel void k1(global int * a, global int * b, global int * c){
int i = get_local_id(0);
global int* ptr;
if (i==0) ptr = &a[0];
if (i==1) ptr = &b[0];
if (i==2) ptr = &c[0];
if (i>2) return;
ptr[0]++;
}
我的问题是,ptr[0]++ 行会在访问内存时创建分支吗?或者与 a[0]++;b[0]++;c[0]++;在同一工作项内。
感谢您抽出时间。
I have a kernel like this
kernel void k1(global int * a, global int * b, global int * c){
int i = get_local_id(0);
global int* ptr;
if (i==0) ptr = &a[0];
if (i==1) ptr = &b[0];
if (i==2) ptr = &c[0];
if (i>2) return;
ptr[0]++;
}
My question is, will the line ptr[0]++ create branching in accessing memory? or will this save time compared to
a[0]++;b[0]++;c[0]++; within same workitem.
Thanks for your time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,这不会创建分支,但至少在某些工作组中会破坏合并的内存访问。
您还可以通过三元运算符 (
condition ? true_action : false_action
) 或通过位掩码。这种简单的 if 语句和三元运算符最有可能转换为无分支汇编。如果没有保护子句 (if(i>2) return;
),则整个代码是无分支的。尽管没有分支,但更重要的是,尤其是使用比您拥有的更复杂的指针算术(例如,当查找表确定指针值时),告诉应该访问哪个内存位置的算术将破坏某些工作组中的合并内存访问,这与简单的分支本身相比,可能会导致更慢的速度(尽管分支也不能解决未对齐的内存访问问题)。
在某些应用程序中,甚至可能以合并方式加载更多全局内存比减少全局内存访问但在进程中破坏合并更快。
No, this does not create branching, but it will break coalesced memory access at least in some workgroups.
You can also entirely replace the
if(i==0) ptr = &a[0];
etc. by either ternary operators (condition ? true_action : false_action
) or by bit masking. Such simple if statements and the ternary operator are most likely to be converterd to branchless assembly. If it weren't for the guard clause (if(i>2) return;
), the entire code then is branchless.Although there is no branching, more importantly, especially with more complex pointer arithmetic than you have (for example when a lookup table determines the pointer value), the arithmetic telling which memory location should be accessed will break coalesced memory access in some workgroups, which could lead to much more slowdown compared to simple branching itself (although branching wouldn't resolve the misaligned memory access as well).
In some applications it is even possible that loading more global memory in coalesced manner is faster than reducing global memory access but breaking coalescence in the process.