现代 CPU 的每周期缓存带宽
现代 CPU 的缓存访问速度是多少? Intel P4、Core2、Corei7、AMD 每个处理器时钟周期可以从内存中读取或写入多少字节?
请回答理论(ld/sd 单元的宽度及其以 uOPs/tick 为单位的吞吐量)和实际数字(甚至 memcpy 速度测试或 STREAM 基准)(如果有)。
PS这是一个问题,与汇编器中加载/存储指令的最大速率有关。可以存在理论加载速率(所有每刻指令都是最宽加载),但处理器只能给出其中的一部分,即加载的实际限制。
What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于 nehalem:rolfed.com/nehalem/nehalemPaper.pdf
128 位 = 16 字节/时钟读取
和
128位=16字节/时钟写入
(可以在一个周期内结合读写吗?)
L2和L3读写端口可以在单个时钟中使用吗?
延迟(时钟滴答声),部分由 CPU-Z 的 latencytool 测量或作者:lmbench 的 lat_mem_rd - 两者都使用长链表遍历来正确测量现代无序核心,如 Intel Core i7
延迟数据的良好来源是 7cpu 网站,例如 Haswell:http ://www.7-cpu.com/cpu/Haswell.html
有关 lat_mem_rd 程序的更多信息,请参见其 手册页或此处。
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
Can L2 and L3 read and write ports be used in single clock?
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
最宽的读/写是 128 位(16 字节)SSE 加载/存储。 L1/L2/L3 缓存具有不同的带宽和延迟,这些当然是特定于 CPU 的。现代 CPU 上的典型 L1 延迟为 2 - 4 个时钟,但您通常可以每个时钟发出 1 或 2 个加载指令。
我怀疑这里潜伏着一个更具体的问题 - 你实际上想要实现什么?您只想编写尽可能最快的 memcpy 吗?
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?