从一个数组中减去另一个数组的最高效方法
我有以下代码,这是我的应用程序的一部分的瓶颈。我所做的就是从另一个数组中减去数组。这两个数组都有大约 100000 个元素。我正在尝试找到一种方法来提高性能。
var
Array1, Array2 : array of integer;
.....
// Code that fills the arrays
.....
for ix := 0 to length(array1)-1
Array1[ix] := Array1[ix] - Array2[ix];
end;
有人有建议吗?
I have the following code which is the bottleneck in one part of my application. All I do is subtract on Array from another. Both of these arrays have more around 100000 elements. I'm trying to find a way to make this more performant.
var
Array1, Array2 : array of integer;
.....
// Code that fills the arrays
.....
for ix := 0 to length(array1)-1
Array1[ix] := Array1[ix] - Array2[ix];
end;
Does anybody have a suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
在多个线程上运行这个,使用这么大的数组将获得线性加速。正如他们所说,这是令人尴尬的平行。
Running this on multiple threads, with that big an array will net linear speed-up. It's embarrassingly parallel as they say.
在更多线程上运行减法听起来不错,但是 100K 整数太阳吸收不会占用大量 CPU 时间,所以可能是线程池...但是设置线程也有很多开销,因此短数组在并行线程中的生产率会比在并行线程中低只有一个(主)线程!
您是否关闭了编译器设置、溢出和范围检查?
您可以尝试使用asm rutine,它非常简单...
类似:
它可以更快...
编辑:添加了SIMD指令的过程。
该程序请求SSE CPU支持。它可以取 XMM 寄存器中的 4 个整数并立即相减。也可以使用
movdqa
代替movdqu
它更快,但您必须首先确保 16 字节对齐。您还可以像我的第一个 asm 案例一样取消 XMM 角色。 (我对速度测量很感兴趣。:))Running subtraction on more threads sounds good, but 100K integer sunstraction don't take a lot of CPU time, so maybe threadpool... However settings threads have also a lot of overhead, so short arrays will have slower productivity in parallel threads than in only one (main) thread!
Did you switch off in compiler settings, overflow and range checking?
You can try to use asm rutine, it is very simple...
Something like:
It can be much faster...
EDIT: Added procedure with SIMD instructions.
This procedure request SSE CPU support. It can take 4 integers in XMM register and subtract at once. There is also possibility to use
movdqa
insteadmovdqu
it is faster, but you must first to ensure 16 byte aligment. You can also unrole the XMM par like in my first asm case. (I'm interesting about speed measurment. :) )我对这个简单案例中的速度优化很好奇。
所以我做了 6 个简单的程序,并在数组大小 100000 时测量 CPU 滴答声和时间;
检查图片和代码上的结果以获取更多信息。
要获得 16 字节内存对齐,首先删除文件“FastMM4Options.inc”指令中的点 { $.define Align16Bytes}
!
...
最快的汇编程序,具有 8 次展开的 SIMD 指令,仅需要 68us,比 Pascal 程序快约 4 倍。
正如我们所看到的,Pascal 循环过程可能并不重要,在 2,4GHz CPU 上进行 100000 次减法时,只需要大约 277us(溢出和范围检查)。
那么这段代码不会成为瓶颈吗?
I was very curious about speed optimisation in this simple case.
So I have made 6 simple procedures and measure CPU tick and time at array size 100000;
Check results on picture and code for more information.
To get 16 byte memory alignment first delite the dot in file 'FastMM4Options.inc' directive {$.define Align16Bytes}
!
...
The fastest asm procedure with 8 times unrolled SIMD instructions takes only 68us and is about 4 time faster than Pascal procedure.
As we can see the Pascal loop procedure probably isn't critical, it takes only about 277us (Overflow and Range checking off) on 2,4GHz CPU at 100000 subtractions.
So this code can't be bottleneck?
我不是汇编专家,但我认为如果您不考虑 SIMD 指令或并行处理,则以下内容接近最佳,后者可以通过将数组的部分传递给函数来轻松完成。
喜欢
线程1:子数组(ar1[0], ar2[0], 50);
线程2: SubArray(ar1[50], ar2[50], 50);
I'm not assembly expert but I think the following are near optimal if you don't take into account SIMD instructions or parallel processing, the later can be easily accomplished by passing portions of the array to the function.
like
Thread1: SubArray(ar1[0], ar2[0], 50);
Thread2: SubArray(ar1[50], ar2[50], 50);
这不是您问题的真正答案,但我会研究是否可以在用值填充数组的某个时间进行减法。我什至可以选择考虑在内存中使用第三个数组来存储减法的结果。在现代计算中,内存的“成本”远低于对内存执行额外操作所需的时间“成本”。
理论上,当减法可以在值仍在寄存器或处理器缓存中时完成时,您至少会获得一点性能,但在实践中,您可能会偶然发现一些可以增强整个算法性能的技巧。
It's not a real answer to your question, but I would investigate if I could do the subtraction already at some time while filling the arrays with values. I would optionally even consider a third array in memory to store the result of the subtraction. In modern computing, the 'cost' of memory is considerably lower than the 'cost' of the time it takes to perform an extra action on memory.
In theory you'll gain at least a little performance when the subtraction can be done while the values are still in registers or processor cache, but in practice you just might stumble upon a few tricks that could enhance performance of the entire algorithm.