针对 Core 2 或 Core i7 架构全面优化 memcpy/memmove?
具有 DDR3 双通道内存的 Core 2 处理器的理论最大内存带宽令人印象深刻:根据维基百科文章< /a> 在架构上,每秒 10+ 或 20+ GB。 然而,普通的 memcpy() 调用无法实现这一点。 (3 GB/s 是我在此类系统上见过的最高速度。)这可能是由于操作系统供应商要求根据处理器的特性针对每个处理器系列调整 memcpy(),因此库存 memcpy() 实现对于许多品牌和产品线来说应该是合理的。
我的问题:是否有可在 C 程序中使用的适用于 Core 2 或 Core i7 处理器的免费、高度调优的版本? 我确信我不是唯一需要它的人,如果每个人都对自己的 memcpy() 进行微优化,那将是一种巨大的浪费。
The theoretical maximum of memory bandwidth for a Core 2 processor with DDR3 dual channel memory is impressive: According to the Wikipedia article on the architecture, 10+ or 20+ gigabytes per second. However, stock memcpy() calls do not attain this. (3 GB/s is the highest I've seen on such systems.) Likely, this is due to the OS vendor requirement that memcpy() be tuned for every processor line based on the processor's characteristics, so a stock memcpy() implementation should be reasonable on a wide number of brands and lines.
My question: Is there a freely available, highly tuned version for Core 2 or Core i7 processors that can be utilized in a C program? I'm sure that I'm not the only person in need of one, and it would be a big waste of effort for everyone to micro-optimize their own memcpy().
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在测量带宽时,您是否考虑到 memcpy 既可以读取又可以写入,因此复制的 3 GB/s 内存实际上是 6 GB/s 带宽?
请记住,带宽是理论上的最大值 - 实际使用量会低得多。 例如,发生一次页面错误,您的带宽将降至 MB/s。
memcpy/memmove 是编译器内在函数,通常会内联到rep movsd(或者适当的SSE指令,如果您的编译器可以定位它)。 对此进行改进代码生成可能是不可能的,因为现代 CPU 将非常非常好地处理这样的代表指令。
When measuring bandwidth did you take into account memcpy was both a read and a write, so 3 GB/s of memory copied is actually 6 GB/s of bandwidth?
Remember, the bandwidth is theoretical maximum - real world use will be much lower. For instance, one page fault and your bandwidth will drop to MB/s.
memcpy/memmove are compiler intrinsics and will usually be inlined to rep movsd (or the appropriate SSE instructions if your compiler can target that). It may be impossible to improve the codegen over this, since modern CPU's will handle rep instructions like this very, very well.
如果您将 /ARCH:SSE2 指定为 MSVC,它应该为您提供经过调整的 memcpy(至少我的是这样)。
如果做不到这一点,请自行使用 SSE 对齐的加载/存储内在函数来复制大块的内存,并在必要时使用 Duff 的字读取设备来处理数据的头部和尾部,以使其到达对齐的边界。 您还需要使用缓存管理内部函数才能获得良好的性能。
您的限制因素可能是缓存未命中和南桥带宽,而不是 CPU 周期。 鉴于内存总线上总会有大量其他流量,我通常很高兴在此类操作中达到理论内存带宽吞吐量的 90% 左右。
If you specify /ARCH:SSE2 to MSVC it should provide you with a tuned memcpy (at least, mine does).
Failing that, use the SSE aligned load/store intrinsics yourself to copy the memory in large chunks, employing a Duff's Device of word reads where necessary to deal with the head and tail of data to get it to an aligned boundary. You'll need to use the cache management intrinsics as well to get good performance.
Your limiting factor is probably cache misses and southbridge bandwidth, rather than CPU cycles. Given that there's always going to be lots of other traffic on the memory bus, I'm usually happy to get to about 90% of theoretical memory bandwidth throughput in such operations.
你可以自己写。 尝试使用intel 优化编译器直接目标架构?
英特尔还生产了一种名为 VTune (编译器和语言无关)用于优化应用程序。
这是一篇文章 优化游戏引擎。
You could write your own. Try using the intel optimising compiler to directly target the architecture?
Intel also produce something called VTune (compiler and language independent) for optimising applications.
Here's an article on optimising a game engine.