使用 SSE 加速计算 - 存储、加载和对齐
在我的项目中,我实现了基本类 CVector。 此类包含指向原始浮点数组的 float* 指针。 该数组是使用标准 malloc() 函数动态分配的。
现在我必须使用这样的向量来加速一些计算。不幸的是,由于内存不是使用 _mm_malloc() 分配的,所以它没有对齐。
据我了解,我有两个选择:
1)重写分配内存以使用 _mm_malloc() 的代码,例如使用如下代码:
void sub(float* v1, float* v2, float* v3, int size)
{
__m128* p_v1 = (__m128*)v1;
__m128* p_v2 = (__m128*)v2;
__m128 res;
for(int i = 0; i < size/4; ++i)
{
res = _mm_sub_ps(*p_v1,*p_v2);
_mm_store_ps(v3,res);
++p_v1;
++p_v2;
v3 += 4;
}
}
2)第二个选项是使用 _mm_loadu_ps() 指令从未对齐的内存加载 __m128,然后使用它用于计算。
void sub(float* v1, float* v2, float* v3, int size)
{
__m128 p_v1;
__m128 p_v2;
__m128 res;
for(int i = 0; i < size/4; ++i)
{
p_v1 = _mm_loadu_ps(v1);
p_v2 = _mm_loadu_ps(v2);
res = _mm_sub_ps(p_v1,p_v2);
_mm_store_ps(v3,res);
v1 += 4;
v2 += 4;
v3 += 4;
}
}
所以我的问题是哪个选项会更好或更快?
In my project I have implemented basic class CVector.
This class contains float* pointer to raw floating point array.
This array is allocated dynamicly using standard malloc() function.
Now I have to speed up some computation using such vectors. Unfortunately as the memory isn't alocated using _mm_malloc() it is not aligned.
As I understand I have two options:
1) Rewrite code which allocates memory to use _mm_malloc() and for example use the code like this:
void sub(float* v1, float* v2, float* v3, int size)
{
__m128* p_v1 = (__m128*)v1;
__m128* p_v2 = (__m128*)v2;
__m128 res;
for(int i = 0; i < size/4; ++i)
{
res = _mm_sub_ps(*p_v1,*p_v2);
_mm_store_ps(v3,res);
++p_v1;
++p_v2;
v3 += 4;
}
}
2) The second option is to use _mm_loadu_ps() instruction to load __m128 from unaligned memory and then use it for computation.
void sub(float* v1, float* v2, float* v3, int size)
{
__m128 p_v1;
__m128 p_v2;
__m128 res;
for(int i = 0; i < size/4; ++i)
{
p_v1 = _mm_loadu_ps(v1);
p_v2 = _mm_loadu_ps(v2);
res = _mm_sub_ps(p_v1,p_v2);
_mm_store_ps(v3,res);
v1 += 4;
v2 += 4;
v3 += 4;
}
}
So my question is which option will be better or faster?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
读取未对齐的 SSE 值的成本非常高。检查英特尔手册,第 4 卷,第 2.2.5.1 章。核心类型有所不同,i7 有额外的硬件以降低成本。但读取跨越 cpu 缓存行边界的值仍然比读取对齐值慢 4.5 倍。与之前的架构相比,速度慢了十倍。
这是巨大的,请对齐内存以避免性能受到影响。从未听说过 _mm_malloc,请使用 Microsoft CRT 中的 _aligned_malloc() 从堆中获取正确对齐的内存。
Reading unaligned SSE values is extraordinary expensive. Check the Intel manuals, volume 4, chapter 2.2.5.1. The core type makes a difference, i7 has extra hardware to make it less costly. But reading a value that straddles the cpu cache line boundary is still 4.5 times slower than reading an aligned value. It is ten times slower on previous architectures.
That's massive, get the memory aligned to avoid that perf hit. Never heard of _mm_malloc, use _aligned_malloc() from the Microsoft CRT to get properly aligned memory from the heap.
看看子弹物理。它已被用于一些电影和知名游戏(GTA4 等)。你可以看看他们超级优化的向量、矩阵和其他数学类,或者直接使用它们。它是在 zlib 许可下发布的,因此您可以随意使用它。不要重新发明轮子。 Bullet、nvidia physx、havok 和其他物理库都经过了真正聪明人的充分测试和优化
take a look at bullet physics. it's been used for a a handful of movies and well known games (GTA4 and others). You can either take a look at their super optimized vector, matrix and other math classes, or just use them instead. it's published under zlib license so you can just use it as you wish. Don't reinvent the wheel. Bullet, nvidia physx, havok and other physics libraries are well tested and optimized by really smart guys