有效地将大复数向量乘以标量 C++

发布于 2024-11-26 17:56:16 字数 1015 浏览 5 评论 0原文

我目前正在尝试最有效地对复数数组（内存对齐方式与 std::complex 相同，但当前使用我们自己的 ADT）与相同的标量值数组进行就地乘法size 作为复数数组。

该算法已经并行化，即调用对象将工作分成线程。此计算是在数百个数百万的数组上完成的 - 因此，可能需要一些时间才能完成。 CUDA 不是该产品的解决方案，尽管我希望如此。我确实可以使用 boost，因此有一些使用 BLAS/uBLAS 的潜力。

然而，我认为 SIMD 可能会产生更好的结果，但我不太熟悉如何处理复数。我现在的代码如下（请记住，它被分成与目标机器上的核心数量相对应的线程）。目标机器也未知。因此，通用方法可能是最好的。

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (register int idx = start; idx < end; ++idx)
    {
        values[idx].real *= scalar[idx];
        values[idx].imag *= scalar[idx];
    }
}

fcomplex 定义如下：

struct fcomplex
{
    float real;
    float imag;
};

我尝试手动展开循环，因为我的finally 循环计数始终是2 的幂，但编译器已经为我执行此操作（我已展开至32）。我尝试了对标量的 const float 引用 - 我认为我会保存一次访问 - 事实证明这与编译器已经在做的事情相同。我尝试过STL和transform，这两种游戏的结果很接近，但仍然更糟。我还尝试转换为 std::complex 并允许它使用标量 * 复数的重载运算符进行乘法，但这最终产生了相同的结果。

那么，有人有什么想法吗？非常感谢您花时间考虑这一点！目标平台是Windows。我正在使用 Visual Studio 2008。产品也不能包含 GPL 代码！非常感谢。

原文

I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.

The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.

I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (register int idx = start; idx < end; ++idx)
    {
        values[idx].real *= scalar[idx];
        values[idx].imag *= scalar[idx];
    }
}

fcomplex is defined as follows:

struct fcomplex
{
    float real;
    float imag;
};

I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.

So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼眸里的那抹悲凉 2024-12-03 17:56:16

您可以使用 SSE 相当轻松地完成此操作，例如

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (int idx = start; idx < end; idx += 2)
    {
        __m128 vc = _mm_load_ps((float *)&values[idx]);
        __m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
        vc = _mm_mul_ps(vc, vk);
        _mm_store_ps((float *)&values[idx], vc);
    }
}

请注意，值和标量需要 16 字节对齐。

或者您可以只使用英特尔 ICC 编译器，让它为您完成繁重的工作。

更新

这是一个改进的版本，它将循环展开 2 倍，并使用单个加载指令来获取 4 个标量值，然后将其解压缩为两个向量：

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (int idx = start; idx < end; idx += 4)
    {
        __m128 vc0 = _mm_load_ps((float *)&values[idx]);
        __m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
        __m128 vk = _mm_load_ps(&scalar[idx]);
        __m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
        __m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
        vc0 = _mm_mul_ps(vc0, vk0);
        vc1 = _mm_mul_ps(vc1, vk1);
        _mm_store_ps((float *)&values[idx], vc0);
        _mm_store_ps((float *)&values[idx + 2], vc1);
    }
}

You can do this fairly easily with SSE, e.g.

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (int idx = start; idx < end; idx += 2)
    {
        __m128 vc = _mm_load_ps((float *)&values[idx]);
        __m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
        vc = _mm_mul_ps(vc, vk);
        _mm_store_ps((float *)&values[idx], vc);
    }
}

Note that values and scalar need to be 16 byte aligned.

Or you could just use the Intel ICC compiler and let it do the hard work for you.

UPDATE

Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
    for (int idx = start; idx < end; idx += 4)
    {
        __m128 vc0 = _mm_load_ps((float *)&values[idx]);
        __m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
        __m128 vk = _mm_load_ps(&scalar[idx]);
        __m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
        __m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
        vc0 = _mm_mul_ps(vc0, vk0);
        vc1 = _mm_mul_ps(vc1, vk1);
        _mm_store_ps((float *)&values[idx], vc0);
        _mm_store_ps((float *)&values[idx + 2], vc1);
    }
}

回复收藏 0 原文

尘世孤行 2024-12-03 17:56:16

最好的选择是使用优化的 BLAS，它将利用目标平台上可用的任何内容。

回复收藏 0 原文

画离情绘悲伤 2024-12-03 17:56:16

我看到的一个问题是，在函数中，编译器很难理解标量指针实际上并未指向复数数组的中间（标量理论上可能指向复数或实数）综合体的一部分）。
这实际上强制了评估的顺序。

我看到的另一个问题是，这里的计算非常简单，其他因素会影响原始速度，因此，如果您真的关心性能，我认为唯一的解决方案是实现几种变体并在用户计算机上运行时测试它们以发现什么是最快的。

我会考虑使用不同的展开大小，并调整标量和值的对齐（内存访问模式会对缓存效果产生很大影响）。

对于不需要的序列化问题，一个选项是查看生成的代码是什么，

float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;

因为理论上优化器有更多的自由度。

One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.

Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.

What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).

For the problem of the unwanted serialization an option is to see what is the generated code for something like

float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;

because here the optimizer has in theory a little bit more freedom.

回复收藏 0 原文