clang内置矩阵和向量扩展：有效的矩阵向量乘法

发布于 2025-02-12 19:12:54 字数 2446 浏览 0 评论 0原文

我正在编写一个小图形3D应用程序，以了解clang vector和矩阵扩展（如果我阅读了Doc 的正确版本，则可以开发。

我不确定如何使用这些类型编写矩阵向量乘法的最有效的代码。使用：

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

文档说（关于访问矩阵元素的索引）：

第一个指定行数，第二个指定列数。

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

因此，我知道做m [2] [3]（其中m是m4x4），会给我上面的矩阵中我指出的x。

然后（关于元素在内存中布局的方式）：

矩阵类型的值的元素以列订单布置而不填充。

因此，我从这个注意到的是，如果我可以查看元素存储在内存中的方式：

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33

我到目前为止正确吗？

我们访问矩阵元素的顺序是否重要？（我在做对吗？）

然后我假设如果我想在M aT-float4乘法中高效，我需要以记忆中的方式访问它们的元素，所以这样做：

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

当然，这取决于我要在m [0] [0]中加载正确的值，m [0] [1]，...使用__builtin_matrix_column_major_load。

我是过于复杂的事情，还是在这里订单很重要。上面的方程是否有效地比：（

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

假设我在调用__ hindin_matrix_column_major_load

。如果我这样做的话，这些类型的重点是对Simd指令进行访问

float4 a = {...};
float4 b = {...};
float4 c = a + b;

。 b的浮子发生在一个周期中？在这种特定情况下，

我的第二个问题是：

我应该将矩阵矢量保留在4 float4中吗
？ matrix-vector和matrix-matrix matrix使用SSE 如何使用SIMD指令实现Mat-Vector乘法的示例。这似乎能够将矩阵的元素堆叠到__ M128中，并使用其他SIMD指令（例如_mm_add_ps and and and）将矩阵元素乘以向量元素代码> mm_mul_ps 。
我应该等待这个发展变得更加成熟吗？

任何反馈或建议将不胜感激。我正在做这件事，以了解这些新内置类型。

原文

I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).

I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

The doc says (regarding the indices to access the elements of a matrix):

The first specifies the number of rows, and the second specifies the number of columns.

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.

Then (regarding the way the elements are laid out in memory):

The elements of a value of a matrix type are laid out in column-major order without padding.

So I get from this note that if I could look at the way the elements are stored in memory I would get:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33

Do I get it right so far?

Does the order in which we access the elements of the matrix matter? (and am I doing it right?)

Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load.

Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

(assuming I have transposed the elements before calling __builtin_matrix_column_major_load.

Is there a better way of doing it?

Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:

float4 a = {...};
float4 b = {...};
float4 c = a + b;

Then adding the 4 floats of a to the respective 4 floats of b happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?

So my second question: is there a better way of doing this?

Should I keep the matrix vectors in 4 float4 and do float4 * float4 multiplications instead?
I saw this post Matrix-Vector and Matrix-Matrix multiplication using SSE that gives an example of how to achieve mat-vector multiplication using SIMD instructions.
This seems to be able to stack the elements of the matrix into __m128 and use those to get the matrix elements multiplied by the vector's elements using additional SIMD instructions such as _mm_add_ps and mm_mul_ps.
Should I just wait for this development to be more mature?

Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冰之心 2025-02-19 19:12:54

如果任何人现在都可以找到这个：

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

铸造到“矩阵”列并定义产品。这确实应该是内置的，尽管就像您说的那样，clang matrix_types是WIP。

顺便说一句：您可以将相同的概念应用于ext_vector_types的点产品，因为（afaik）也不是内置的。点将将float1x4乘以float4x1（按此顺序）。

In case anyone finds this now:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

Cast to a column "matrix" and the product is defined. This really should be built-in, although, like you said, Clang matrix_types are WIP.

BTW: You can apply the same concept to the dot product of ext_vector_types since (AFAIK) that isn't built-in either. Dot would be multiplying a float1x4 by a float4x1 (in that order).

回复收藏 0 原文

~没有更多了~