clang内置矩阵和向量扩展:有效的矩阵向量乘法

发布于 2025-02-12 19:12:54 字数 2446 浏览 0 评论 0原文

我正在编写一个小图形3D应用程序,以了解clang vector和矩阵扩展(如果我阅读了Doc 的正确版本,则可以开发。

我不确定如何使用这些类型编写矩阵向量乘法的最有效的代码。使用:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

文档说(关于访问矩阵元素的索引):

第一个指定行数,第二个指定列数。

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

因此,我知道做m [2] [3](其中m是m4x4),会给我上面的矩阵中我指出的x。

然后(关于元素在内存中布局的方式):

矩阵类型的值的元素以列订单布置而不填充。

因此,我从这个注意到的是,如果我可以查看元素存储在内存中的方式:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33 

我到目前为止正确吗?

我们访问矩阵元素的顺序是否重要? (我在做对吗?)

然后我假设如果我想在M aT-float4乘法中高效,我需要以记忆中的方式访问它们的元素,所以这样做:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

当然,这取决于我要在m [0] [0]中加载正确的值,m [0] [1],...使用__builtin_matrix_column_major_load

我是过于复杂的事情,还是在这里订单很重要。上面的方程是否有效地比:(

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

假设我在调用__ hindin_matrix_column_major_load

。如果我这样做的话,这些类型的重点是对Simd指令进行访问

float4 a = {...};
float4 b = {...};
float4 c = a + b;

b的浮子发生在一个周期中?在这种特定情况下,

我的第二个问题是:

  • 我应该将矩阵矢量保留在4 float4中吗
  • matrix-vector和matrix-matrix matrix使用SSE 如何使用SIMD指令实现Mat-Vector乘法的示例。 这似乎能够将矩阵的元素堆叠到__ M128中,并使用其他SIMD指令(例如_mm_add_ps and and and)将矩阵元素乘以向量元素代码> mm_mul_ps 。
  • 我应该等待这个发展变得更加成熟吗?

任何反馈或建议将不胜感激。我正在做这件事,以了解这些新内置类型。

I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).

I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

The doc says (regarding the indices to access the elements of a matrix):

The first specifies the number of rows, and the second specifies the number of columns.

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.

Then (regarding the way the elements are laid out in memory):

The elements of a value of a matrix type are laid out in column-major order without padding.

So I get from this note that if I could look at the way the elements are stored in memory I would get:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33 

Do I get it right so far?

Does the order in which we access the elements of the matrix matter? (and am I doing it right?)

Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load.

Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

(assuming I have transposed the elements before calling __builtin_matrix_column_major_load.

Is there a better way of doing it?

Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:

float4 a = {...};
float4 b = {...};
float4 c = a + b;

Then adding the 4 floats of a to the respective 4 floats of b happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?

So my second question: is there a better way of doing this?

  • Should I keep the matrix vectors in 4 float4 and do float4 * float4 multiplications instead?
  • I saw this post Matrix-Vector and Matrix-Matrix multiplication using SSE that gives an example of how to achieve mat-vector multiplication using SIMD instructions.
    This seems to be able to stack the elements of the matrix into __m128 and use those to get the matrix elements multiplied by the vector's elements using additional SIMD instructions such as _mm_add_ps and mm_mul_ps.
  • Should I just wait for this development to be more mature?

Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冰之心 2025-02-19 19:12:54

如果任何人现在都可以找到这个:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

铸造到“矩阵”列并定义产品。这确实应该是内置的,尽管就像您说的那样,clang matrix_types是WIP。

顺便说一句:您可以将相同的概念应用于ext_vector_types的点产品,因为(afaik)也不是内置的。点将将float1x4乘以float4x1(按此顺序)。

In case anyone finds this now:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

Cast to a column "matrix" and the product is defined. This really should be built-in, although, like you said, Clang matrix_types are WIP.

BTW: You can apply the same concept to the dot product of ext_vector_types since (AFAIK) that isn't built-in either. Dot would be multiplying a float1x4 by a float4x1 (in that order).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文