有没有好的 x86 双精度小矩阵 SIMD 库?
我正在寻找一个专注于图形小型 (4x4) 矩阵运算的 SIMD 库。那里有很多单精度,但我需要支持单精度和双精度。
我看过 Intel 的 IPP MX 库,但我更喜欢带有源代码的库。我对这些特定操作的 SSE3+ 实现非常感兴趣:
- Mat4 * Mat4
- Mat4 * Vec4
- Mat4 * Mat4
- Mat4 数组 * Vec4
- Mat4 反转数组(很高兴拥有)
编辑:请不要“过早优化”答案。任何使用过小矩阵的人都知道,GCC 不会像手工优化的内在函数或 ASM 那样对它们进行矢量化。在这种情况下,这很重要,否则我不会问。
I'm looking for a SIMD library focused small (4x4) matrix operations for graphics. There's lots of single precision ones out there, but I need to support both single and double precision.
I've looked at Intel's IPP MX library, but I'd prefer something with source. I'm very interested in SSE3+ implementations of these particular operations:
- Mat4 * Mat4
- Mat4 * Vec4
- Mat4 * Array of Mat4
- Mat4 * Array of Vec4
- Mat4 inversion (nice to have)
EDIT: No "premature optimization" answers please. Anyone who has worked with small matrices knows GCC does not vectorize these as well as hand optimized intrinsics or ASM. And in this case it's important, or I wouldn't be asking.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
也许是 Eigen 库?
它支持SSE 2/3/4、ARM NEON和AltiVec指令集。
Maybe the Eigen library?
It supports SSE 2/3/4, ARM NEON and AltiVec instruction set.
Eigen 支持固定大小的矩阵。可以在堆栈上分配小型固定大小矩阵以获得更好的性能。 4x4 适合 SSE,因为 SSE 向量大小为 128 位。一行或一列 4 个双精度数字将均匀地适合 2x128 位 SSE 向量。这使得 SIMD 的实现变得容易。
另一种选择是自己编写代码。由于您的矩阵很小并且适合 L1 缓存,因此您不必担心大型矩阵所需的内存标题。您可以使用 AVX 获得更好的性能。较新版本的 GCC 和 Visual C++ 2010 支持 AVX 内在函数。 AVX向量大小为256位,可以恰好容纳4个双精度数字。
Eigen supports fixed size matrices. Small fixed size matrices can be allocated on stack for better performance. 4x4 is good for SSE, since SSE vector size is 128 bits. A row or a column of 4 double precision numbers would fit evenly into 2x128 bit SSE vectors. This makes SIMD implementation easy.
Another option is to code it yourself. Since your matrices are small and fit into L1 cache, you don't have to bother with memory titling needed for large matrices. You could use AVX for even better performance. Newer versions of GCC and Visual C++ 2010 support AVX intrinsics. AVX vector size is 256 bit can hold exactly 4 double precision numbers.
尚未完全完成,但我想推介我自己的库 - glsl-sse2。
Not fully complete yet, but I wanted to pitch my own library - glsl-sse2.
此处有一个 4x4 AVX 实现。它是作为示例应用程序编写的,但我确信对于任何人来说,将有趣的部分提取到共享库中都不会太难。我想我会发布这个,尽管最初的问题已经很老了,对于未来任何在这里下车的人来说。
There's a 4x4 AVX implementation here. It's written as an example application but I'm sure it wouldn't be too hard for anyone to extract the interesting parts into a shared library. Thought I'd post this despite the age of the original question for anyone alighting here in the future.
如果您使用的是现代编译器,您可能不需要费心。大多数编译器的自动矢量化应该能够轻松地将具有固定边界的
for
循环转换为 SIMD 代码。 GCC 已经有这个相当长一段时间了,它是英特尔编译器的主要卖点(尽管如果您可能想使用 AMD 芯片,则在使用英特尔编译器时应该小心)。If you're using a modern compiler, you probably don't need to bother. Automatic vectorization from most compilers should be able to easily transform
for
loops with fixed bounds in to SIMD code. GCC has had this for quite a while, and it is one of the main selling points of Intel's compiler (though you should be careful about using Intel's compiler if you might want to use AMD chips).