数组中值的快速乘法
是否有一种快速方法可以在 C++ 中乘以浮点数组的值,以优化此函数(其中 count
是 4 的倍数):
void multiply(float* values, float factor, int count)
{
for(int i=0; i < count; i++)
{
*value *= factor;
value++;
}
}
解决方案必须适用于 Mac OS X 和 Windows、Intel 和非-英特尔。想想 SSE、矢量化、编译器(gcc 与 MSVC)。
Is there a fast way to multiply values of a float array in C++, to optimize this function (where count
is a multiple of 4):
void multiply(float* values, float factor, int count)
{
for(int i=0; i < count; i++)
{
*value *= factor;
value++;
}
}
A solution must work on Mac OS X and Windows, Intel and non-Intel. Think SSE, vectorization, compiler (gcc vs. MSVC).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您希望代码是跨平台的,那么您要么必须编写与平台无关的代码,要么必须编写大量
#ifdef
。您是否尝试过一些手动循环展开,看看它是否有什么不同?
If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of
#ifdef
s.Have you tried some manual loop unrolling, and seeing if it makes any difference?
由于您知道
count
是 4 的倍数,因此您可以展开循环...Since you know the
count
is a multiple of 4, you can unroll your loop...免责声明:显然,这不适用于 iPhone、iPad、Android 或未来的同类产品。
Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.
您考虑过 OpenMP 吗?
大多数现代计算机都具有多核 CPU,几乎每个主要编译器似乎都内置了 OpenMP。您几乎不付出任何代价即可获得速度。
请参阅Wikipedia 有关 OpenMP 的文章。
Have you thought of OpenMP?
Most modern computers have multi-core CPUs and nearly every major compiler seems to have OpenMP built-in. You gain speed at barely any cost.
See Wikipedia's article on OpenMP.
最好的解决方案是保持简单,并让编译器为您优化它。
GCC 了解 SSE、SSE2、altivec 以及其他内容。
如果您的代码太复杂,您的编译器将无法针对每个可能的目标对其进行优化。
The best solution is to keep it simple, and let the compiler optimize it for you.
GCC knows about SSE, SSE2, altivec and what else.
If your code is too complex, your compiler won't be able to optimize it on every possible target.
正如您所提到的,有许多架构具有 SIMD 扩展,并且 SIMD 可能是优化时的最佳选择。然而,它们都是特定于平台的,并且 C 和 C++ 语言对 SIMD 不友好。
然而,您应该尝试的第一件事是为给定的构建启用 SIMD 特定标志。编译器可以识别可以使用 SIMD 优化的模式。
接下来是在适当的情况下使用编译器内部函数或汇编语言编写平台特定的 SIMD 代码。但是,您应该为没有优化版本的平台保留可移植的非 SIMD 实现。
#ifdef
在支持 SIMD 的平台上启用 SIMD。最后,至少在 ARM 上(但在 Intel 上不确定)要注意,较小的整数和浮点类型允许每个 SIMD 指令执行更多数量的并行操作。
As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.
The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.
The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version.
#ifdef
s enable SIMD on platforms that support it.Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.
我认为,你能做的事情并不多,不会产生很大的影响。也许您可以使用 OpenMP 或 SSE 稍微加快速度。但现代 CPU 已经相当快了。在某些应用程序中,内存带宽/延迟实际上是瓶颈,而且情况会变得更糟。我们已经有了三级缓存,需要智能预取算法来避免巨大的延迟。因此,考虑内存访问模式也是有意义的。例如,如果您实现这样的
乘
和加
并像这样使用它:您基本上在内存块上传递了两次。根据向量的大小,它可能不适合 L1 缓存,在这种情况下,传递它两次会增加一些额外的时间。这显然很糟糕,您应该尝试将内存访问保持在“本地”。在这种情况下,单个循环
可能会更快。根据经验:尝试线性访问内存并尝试“本地”访问内存,我的意思是,尝试重用 L1 缓存中已有的数据。只是一个想法。
I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a
multiply
and anadd
and use it like this:you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop
is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.