iPhone 上的 memcpy 是否以某种方式加速?
几天前,我正在编写一些代码,我注意到复制 memcpy 的 RAM 比在 for 循环中复制它要快得多。
我现在没有测量到(也许我稍后做了),但我记得同一块 RAM 在 for qas 中复制了大约 memcpy 300 毫秒或更长的时间在 20 毫秒或更短的时间内复制。
有可能,memcpy硬件加速吗?
Few days ago I was writing some code and I had noticed that copying
RAM by memcpy was much-much faster than copying it in for loop.
I got no measurements now (maybe I did some time later) but as I remember the same block of RAM which in for qas copied in about
300 ms or more by memcpy was copied in 20 ms or less.
It is possible, is memcpy hardware acelerated?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,我不能谈论 Apple 的编译器,但 gcc 绝对对待
memcpy
作为内置。Well, I can't speak about Apple's compilers, but gcc definitely treats
memcpy
as a builtin.memcpy
的内置实现往往针对相关平台进行了大量优化,因此它通常比简单的 for 循环更快。一些优化包括一次尽可能多地复制(不是单个字节而是整个字,或者如果相关处理器支持的话,甚至更多)、某种程度的循环展开等。当然,最佳优化过程取决于平台,因此通常最好坚持使用内置函数。
在大多数情况下,它是由比用户更有经验的人编写的。
The built-in implementation of
memcpy
tends to be optimized pretty heavily for the platform in question, so it will usually be faster than a naive for loop.Some optimizations include copying as much as possible at a time (not single bytes but rather whole words, or if the processor in question supports it, even more), some degree of loop unrolling, etc. Of course the best course of optimization depends on the platform, so it's usually best to stick to the built-in function.
In most cases it's written by way more experienced people than the user anyways.
有时mem-to-mem DMA是在处理器中实现的,所以,是的,如果iPhone中存在这样的东西,那么memcpy()很可能会利用它。即使它没有实现,我对 memcpy( ) 似乎比逐个字符复制具有 15 比 1 的优势并不感到惊讶。
道德 1:如果可能的话,总是更喜欢 memcpy( ) 而不是 strcpy( )。
道理 2:总是更喜欢 memmove() 而不是 memcpy();总是。
Sometimes mem-to-mem DMA is implemented in processors so, yes, if such a thing exists in the iPhone, then it's likely that memcpy( ) takes advantage of it. Even if it were not implemented, I'm not surprised by the 15-to-1 advantage that memcpy( ) seems to have over your character-by-character copy.
Moral 1: always prefer memcpy( ) to strcpy( ) if possible.
Moral 2: always prefer memmove( ) to memcpy( ); always.
最新的 iPhone 的 ARM 芯片上有 SIMD 指令,允许同时进行 4 项计算。这包括移动内存。
此外,如果您创建高度优化的 memcpy,通常会将循环展开到一定数量,并将其实现为 达夫斯装置
The newest iPhone has SIMD instructions on the ARM chip allowing for 4 calculations at the same time. This includes moving memory around.
Also, if you create a highly optimized memcpy, you'd typically unroll loops to a certain amount, and implement it as a duffs device
看起来 ARM CPU 的指令每次访问可以复制 48 位。我敢打赌,以更大的块进行操作的开销会更低,这就是您所看到的。
It looks like the ARM CPU has instructions that can copy 48 bits per access. I'd bet the lower overhead of doing it in larger chunks is what you're seeing.