现代 CPU 的每周期缓存带宽

发布于 2024-08-22 22:21:16 字数 251 浏览 17 评论 0原文

现代 CPU 的缓存访问速度是多少？ Intel P4、Core2、Corei7、AMD 每个处理器时钟周期可以从内存中读取或写入多少字节？

请回答理论（ld/sd 单元的宽度及其以 uOPs/tick 为单位的吞吐量）和实际数字（甚至 memcpy 速度测试或 STREAM 基准）（如果有）。

PS这是一个问题，与汇编器中加载/存储指令的最大速率有关。可以存在理论加载速率（所有每刻指令都是最宽加载），但处理器只能给出其中的一部分，即加载的实际限制。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾城°AllureLove 2024-08-29 22:21:16

对于 nehalem：rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.

128 位 = 16 字节/时钟读取
和
128位=16字节/时钟写入
（可以在一个周期内结合读写吗？）

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

L2和L3读写端口可以在单个时钟中使用吗？

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

延迟（时钟滴答声），部分由 CPU-Z 的 latencytool 测量或作者：lmbench 的 lat_mem_rd - 两者都使用长链表遍历来正确测量现代无序核心，如 Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/

延迟数据的良好来源是 7cpu 网站，例如 Haswell：http ://www.7-cpu.com/cpu/Haswell.html

有关 lat_mem_rd 程序的更多信息，请参见其手册页或此处。

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.

128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

回复收藏 0 原文