使用 SSE 加速计算 - 存储、加载和对齐

发布于 2024-10-19 00:10:00 字数 1059 浏览 1 评论 0原文

在我的项目中，我实现了基本类 CVector。此类包含指向原始浮点数组的 float* 指针。该数组是使用标准 malloc() 函数动态分配的。

现在我必须使用这样的向量来加速一些计算。不幸的是，由于内存不是使用 _mm_malloc() 分配的，所以它没有对齐。

据我了解，我有两个选择：

1）重写分配内存以使用 _mm_malloc() 的代码，例如使用如下代码：

void sub(float* v1, float* v2, float* v3, int size) 
{  
    __m128* p_v1 = (__m128*)v1;  
    __m128* p_v2 = (__m128*)v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        res = _mm_sub_ps(*p_v1,*p_v2);  
        _mm_store_ps(v3,res);  
        ++p_v1;  
        ++p_v2;  
        v3 += 4;  
    }
}

2）第二个选项是使用 _mm_loadu_ps() 指令从未对齐的内存加载 __m128，然后使用它用于计算。

void sub(float* v1, float* v2, float* v3, int size)
{  
    __m128 p_v1;  
    __m128 p_v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        p_v1 = _mm_loadu_ps(v1);   
        p_v2 = _mm_loadu_ps(v2);  
        res = _mm_sub_ps(p_v1,p_v2);    
        _mm_store_ps(v3,res);  
        v1 += 4;  
        v2 += 4;  
        v3 += 4;  
    }
}

所以我的问题是哪个选项会更好或更快？

原文

In my project I have implemented basic class CVector.
This class contains float* pointer to raw floating point array.
This array is allocated dynamicly using standard malloc() function.

Now I have to speed up some computation using such vectors. Unfortunately as the memory isn't alocated using _mm_malloc() it is not aligned.

As I understand I have two options:

1) Rewrite code which allocates memory to use _mm_malloc() and for example use the code like this:

void sub(float* v1, float* v2, float* v3, int size) 
{  
    __m128* p_v1 = (__m128*)v1;  
    __m128* p_v2 = (__m128*)v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        res = _mm_sub_ps(*p_v1,*p_v2);  
        _mm_store_ps(v3,res);  
        ++p_v1;  
        ++p_v2;  
        v3 += 4;  
    }
}

2) The second option is to use _mm_loadu_ps() instruction to load __m128 from unaligned memory and then use it for computation.

void sub(float* v1, float* v2, float* v3, int size)
{  
    __m128 p_v1;  
    __m128 p_v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        p_v1 = _mm_loadu_ps(v1);   
        p_v2 = _mm_loadu_ps(v2);  
        res = _mm_sub_ps(p_v1,p_v2);    
        _mm_store_ps(v3,res);  
        v1 += 4;  
        v2 += 4;  
        v3 += 4;  
    }
}

So my question is which option will be better or faster?

分享到QQ

分享到微博