Intel x86 处理器的 L1 内存缓存记录在哪里?
我正在尝试分析和优化算法,我想了解缓存对各种处理器的具体影响。 对于最近的Intel x86处理器(例如Q9300),很难找到有关缓存结构的详细信息。 特别是,大多数网站(包括 Intel.com)都支持后处理器规格不包括任何对 L1 缓存的引用。 这是因为 L1 缓存不存在,还是由于某种原因该信息被认为不重要? 有没有关于消除 L1 缓存的文章或讨论?
[编辑] 在运行各种测试和诊断程序(主要是下面答案中讨论的那些)之后,我得出的结论是我的 Q9300 似乎有 32K L1 数据缓存。 我仍然没有找到明确的解释来解释为什么这些信息如此难以获得。 我目前的工作理论是,L1 缓存的细节现在被英特尔视为商业秘密。
I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?
[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
找到有关英特尔缓存的规格几乎是不可能的。 去年,当我教授缓存课程时,我询问了英特尔内部的朋友(编译器组),他们找不到规格。
但是等等!!! Jed,保佑他的灵魂,告诉我们在 Linux 系统上,你可以从内核中挤出大量信息:
这将为您提供关联性、集合大小和一堆其他信息(但不包括延迟)。
例如,我了解到,尽管 AMD 宣传其 128K L1 缓存,但我的 AMD 机器却拥有各 64K 的分离式 I 和 D 缓存。
感谢 Jed,现在有两条建议已经过时:
开源工具
valgrind
内部有各种缓存模型,并且它对于分析和理解缓存行为非常有价值。 它附带了一个非常好的可视化工具kcachegrind
,它是 KDE SDK 的一部分。例如:2008 年第 3 季度,AMD K8/K10 CPU 使用 64 字节缓存行,每个 L1I/L1D 分割缓存为 64kB。 L1D 是 2 路关联且与 L2 互斥,延迟为 3 个周期。 L2 缓存为 16 路关联,延迟约为 12 个周期。
AMD Bulldozer 系列 CPU 使用带有 16kiB 4 路关联的分离式 L1每个集群 L1D(每个核心 2 个)。
Intel CPU 长期以来一直保持 L1 不变(从 Pentium M 到 Haswell 到Skylake,大概是之后的许多代):每个 I 和 D 缓存分割 32kB,L1D 是 8 路关联。 64 字节高速缓存线,与 DDR DRAM 的突发传输大小相匹配。 加载使用延迟约为 4 个周期。
另请参阅 x86 标签 wiki,获取更多性能的链接和微架构数据。
It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.
But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:
This will give you associativity, set size, and a bunch of other information (but not latency).
For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.
Two suggestions which are now mostly obsolete thanks to Jed:
AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).
The open-source tool
valgrind
has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization toolkcachegrind
which is part of the KDE SDK.For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.
AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).
Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.
Also see the x86 tag wiki for links to more performance and microarchitectural data.
此英特尔手册:英特尔® 64 和 IA-32 架构优化参考手册对缓存注意事项进行了很好的讨论。
第 46 页,第 2.2.5.1 节 英特尔® 64 和 IA-32 架构优化参考手册
即使 MicroSlop 也意识到需要更多工具来监控缓存使用情况和性能,并且拥有一个 GetLogicalProcessorInformation() 函数 示例(...同时在过程中创建可笑的长函数名称方面开辟了新途径)我想我会编写代码。
更新一:Hazwell 将缓存负载性能提高了 2 倍,来自 在 Tock 内; Haswell 的架构
如果对充分利用缓存的重要性有任何疑问,前 Azul 员工 Cliff Click 的演示应该可以消除所有疑虑。 用他的话来说,“内存就是新磁盘!”。
更新 II:SkyLake 显着改进的缓存性能规格。
This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.
Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual
Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.
UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture
If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".
UPDATE II: SkyLake's significantly improved cache performance specifications.
您正在查看消费者规范,而不是开发人员规范。 这是您需要的文档。 缓存大小因处理器系列子型号而异,因此它们通常不在 IA-32 开发手册中,但您可以轻松地在 NewEgg 等上查找它们。
编辑:更具体地说:第 3A 卷(系统编程指南)的第 10 章、优化参考手册的第 7 章,以及 TLB 页面缓存手册中可能的内容,尽管我认为更进一步从L1出来比你关心的多。
You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.
Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.
我做了更多调查。 苏黎世联邦理工学院有一个小组构建了一个内存性能评估工具,或许能够获得至少有关 L1 和 L2 缓存大小(也可能还有关联性)的信息。 该程序的工作原理是通过实验尝试不同的读取模式并测量产生的吞吐量。 Bryant 和 O'Hallaron 的流行教科书使用了简化版本。
I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.
这些平台上存在 L1 缓存。 在内存和前端总线速度超过 CPU 速度之前,这几乎肯定会保持不变,而这很可能是一个很长的路要走。
在 Windows 上,您可以使用 GetLogicalProcessorInformation 获取某种级别的缓存信息(大小、行大小、关联性等) Win7 上的 Ex 版本将提供更多数据,例如哪些核心共享哪个缓存。 CpuZ 也提供了此信息。
L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.
On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.
引用局部性对某些算法的性能有重大影响; L1、L2(以及较新的 CPU 上的 L3)缓存的大小和速度显然在其中发挥了很大的作用。 矩阵乘法就是这样一种算法。
Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.
英特尔手册卷。 2 指定以下公式来计算缓存大小:
其中
Ways
、Partitions
、Line_Size
和使用
,并将cpuid
查询集合eax
设置为0x04
。提供头文件声明
x86_cache_size.h
:实现如下:
在我的机器上工作如下:
Intel Manual Vol. 2 specifies the following formula to compute cache size:
Where the
Ways
,Partitions
,Line_Size
andSets
are queried usingcpuid
witheax
set to0x04
.Providing the header file declaration
x86_cache_size.h
:The implementation looks as follows:
Which on my machine works as follows: