现代 CPU 的每周期缓存带宽

发布于 2024-08-22 22:21:16 字数 251 浏览 11 评论 0原文

现代 CPU 的缓存访问速度是多少? Intel P4、Core2、Corei7、AMD 每个处理器时钟周期可以从内存中读取或写入多少字节?

请回答理论(ld/sd 单元的宽度及其以 uOPs/tick 为单位的吞吐量)和实际数字(甚至 memcpy 速度测试或 STREAM 基准)(如果有)。

PS这是一个问题,与汇编器中加载/存储指令的最大速率有关。可以存在理论加载速率(所有每刻指令都是最宽加载),但处理器只能给出其中的一部分,即加载的实际限制。

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

倾城°AllureLove 2024-08-29 22:21:16

对于 nehalem:rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 位 = 16 字节/时钟读取

128位=16字节/时钟写入
(可以在一个周期内结合读写吗?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

L2和L3读写端口可以在单个时钟中使用吗?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

延迟(时钟滴答声),部分由 CPU-Z 的 latencytool 测量或作者:lmbench 的 lat_mem_rd - 两者都使用长链表遍历来正确测量现代无序核心,如 Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/    

延迟数据的良好来源是 7cpu 网站,例如 Haswell:http ://www.7-cpu.com/cpu/Haswell.html

有关 lat_mem_rd 程序的更多信息,请参见其 手册页此处

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/    

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

无远思近则忧 2024-08-29 22:21:16

最宽的读/写是 128 位(16 字节)SSE 加载/存储。 L1/L2/L3 缓存具有不同的带宽和延迟,这些当然是特定于 CPU 的。现代 CPU 上的典型 L1 延迟为 2 - 4 个时钟,但您通常可以每个时钟发出 1 或 2 个加载指令。

我怀疑这里潜伏着一个更具体的问题 - 你实际上想要实现什么?您只想编写尽可能最快的 memcpy 吗?

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文