金属仪器 simdgroup_load 或 simdgroup_store 有什么问题？

发布于 2025-01-18 00:33:58 字数 1655 浏览 8 评论 0原文

操作系统：MacOS 12.2.1

硬件：MacBook Pro 2020，M1

金属：2.4

Xcode：13.2.1

这是我的测试计算机内核， 使用SimdGroup_load adn With With With with With with simdgroup_store_store_store 读取输入缓冲区

kernel void fun(
const device half * Src                 [[ buffer(0) ]],
constant uint4 & SrcShape               [[ buffer(1) ]],
device half * Dst                       [[ buffer(2) ]],
constant uint4 & DstShape               [[ buffer(3) ]],
const device half * Weight              [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid    [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup  [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup         [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid         [[ thread_position_in_grid ]])
{

    const int SrcSlices = (int)SrcShape[0];
    const int SrcHeight = (int)SrcShape[1];
    const int SrcWidth  = (int)SrcShape[2];
    const int DstSlices = (int)DstShape[0];
    const int DstHeight = (int)DstShape[1];
    const int DstWidth  = (int)DstShape[2];
    const int Kernel_X = 3;
    const int KernelElemNum = 3 * 3;
    const int N_Pack = 8;

   // Test only 1 thread
   if(thread_position_in_grid.z != 0|| thread_position_in_grid.y != 0|| thread_position_in_grid.x * N_Pack != 0) return;

    simdgroup_half8x8 sgMatY;
    simdgroup_load(sgMatY, Src);

    simdgroup_store(sgMatY, Dst);

}

这是一个简单的着色器，但是，输出缓冲区仅从输入缓冲区保存前2个值，其他62个值都是零。

的结果

这是Xcode Metal Capture

。 net/fi1pu.png“ alt =”“>

原文

OS: MacOS 12.2.1

Hardwear: MacBook Pro 2020, M1

Metal: 2.4

Xcode： 13.2.1

Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store

kernel void fun(
const device half * Src                 [[ buffer(0) ]],
constant uint4 & SrcShape               [[ buffer(1) ]],
device half * Dst                       [[ buffer(2) ]],
constant uint4 & DstShape               [[ buffer(3) ]],
const device half * Weight              [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid    [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup  [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup         [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid         [[ thread_position_in_grid ]])
{

    const int SrcSlices = (int)SrcShape[0];
    const int SrcHeight = (int)SrcShape[1];
    const int SrcWidth  = (int)SrcShape[2];
    const int DstSlices = (int)DstShape[0];
    const int DstHeight = (int)DstShape[1];
    const int DstWidth  = (int)DstShape[2];
    const int Kernel_X = 3;
    const int KernelElemNum = 3 * 3;
    const int N_Pack = 8;

   // Test only 1 thread
   if(thread_position_in_grid.z != 0|| thread_position_in_grid.y != 0|| thread_position_in_grid.x * N_Pack != 0) return;

    simdgroup_half8x8 sgMatY;
    simdgroup_load(sgMatY, Src);

    simdgroup_store(sgMatY, Dst);

}

It's a simple shader, however output buffer only save first 2 values from input buffer，the other 62 values are ALL ZERO.

Here is the result from Xcode Metal Capture