金属仪器 simdgroup_load 或 simdgroup_store 有什么问题?
操作系统:MacOS 12.2.1
硬件:MacBook Pro 2020,M1
金属:2.4
Xcode:13.2.1
这是我的测试计算机内核, 使用SimdGroup_load adn With With With with With with simdgroup_store_store_store 读取输入缓冲区
kernel void fun(
const device half * Src [[ buffer(0) ]],
constant uint4 & SrcShape [[ buffer(1) ]],
device half * Dst [[ buffer(2) ]],
constant uint4 & DstShape [[ buffer(3) ]],
const device half * Weight [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid [[ thread_position_in_grid ]])
{
const int SrcSlices = (int)SrcShape[0];
const int SrcHeight = (int)SrcShape[1];
const int SrcWidth = (int)SrcShape[2];
const int DstSlices = (int)DstShape[0];
const int DstHeight = (int)DstShape[1];
const int DstWidth = (int)DstShape[2];
const int Kernel_X = 3;
const int KernelElemNum = 3 * 3;
const int N_Pack = 8;
// Test only 1 thread
if(thread_position_in_grid.z != 0|| thread_position_in_grid.y != 0|| thread_position_in_grid.x * N_Pack != 0) return;
simdgroup_half8x8 sgMatY;
simdgroup_load(sgMatY, Src);
simdgroup_store(sgMatY, Dst);
}
这是一个简单的着色器,但是,输出缓冲区仅从输入缓冲区保存前2个值,其他62个值都是零。
的结果
这是Xcode Metal Capture
。 net/fi1pu.png“ alt =”“>
OS: MacOS 12.2.1
Hardwear: MacBook Pro 2020, M1
Metal: 2.4
Xcode: 13.2.1
Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store
kernel void fun(
const device half * Src [[ buffer(0) ]],
constant uint4 & SrcShape [[ buffer(1) ]],
device half * Dst [[ buffer(2) ]],
constant uint4 & DstShape [[ buffer(3) ]],
const device half * Weight [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid [[ thread_position_in_grid ]])
{
const int SrcSlices = (int)SrcShape[0];
const int SrcHeight = (int)SrcShape[1];
const int SrcWidth = (int)SrcShape[2];
const int DstSlices = (int)DstShape[0];
const int DstHeight = (int)DstShape[1];
const int DstWidth = (int)DstShape[2];
const int Kernel_X = 3;
const int KernelElemNum = 3 * 3;
const int N_Pack = 8;
// Test only 1 thread
if(thread_position_in_grid.z != 0|| thread_position_in_grid.y != 0|| thread_position_in_grid.x * N_Pack != 0) return;
simdgroup_half8x8 sgMatY;
simdgroup_load(sgMatY, Src);
simdgroup_store(sgMatY, Dst);
}
It's a simple shader, however output buffer only save first 2 values from input buffer,the other 62 values are ALL ZERO.
Here is the result from Xcode Metal Capture
How to debug or fix it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
更多
发布评论
评论(1)
错误已修复。
有意使用此功能的人可以参考 TF-Lite 中的实现:
https://github.com/alpa-projects/tensorflow-alpa/blob/ee8f6612b515ada4509fa53491c5ba5b3ef8524a/tensorflow/lite/delegates/gpu/common/tasks/conv_metal_simd.cc
Bug fixed.
Anyone who intent to use this feature can refer in the implementation in TF-Lite:
https://github.com/alpa-projects/tensorflow-alpa/blob/ee8f6612b515ada4509fa53491c5ba5b3ef8524a/tensorflow/lite/delegates/gpu/common/tasks/conv_metal_simd.cc