Intel x86 处理器的 L1 内存缓存记录在哪里?

发布于 2024-07-16 12:47:36 字数 440 浏览 6 评论 0原文

我正在尝试分析和优化算法,我想了解缓存对各种处理器的具体影响。 对于最近的Intel x86处理器(例如Q9300),很难找到有关缓存结构的详细信息。 特别是,大多数网站(包括 Intel.com)都支持后处理器规格不包括任何对 L1 缓存的引用。 这是因为 L1 缓存不存在,还是由于某种原因该信息被认为不重要? 有没有关于消除 L1 缓存的文章或讨论?

[编辑] 在运行各种测试和诊断程序(主要是下面答案中讨论的那些)之后,我得出的结论是我的 Q9300 似乎有 32K L1 数据缓存。 我仍然没有找到明确的解释来解释为什么这些信息如此难以获得。 我目前的工作理论是,L1 缓存的细节现在被英特尔视为商业秘密。

I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?

[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

伏妖词 2024-07-23 12:47:36

找到有关英特尔缓存的规格几乎是不可能的。 去年,当我教授缓存课程时,我询问了英特尔内部的朋友(编译器组),他们找不到规格。

但是等等!!! Jed,保佑他的灵魂,告诉我们在 Linux 系统上,你可以从内核中挤出大量信息:

grep . /sys/devices/system/cpu/cpu0/cache/index*/*

这将为您提供关联性、集合大小和一堆其他信息(但不包括延迟)。
例如,我了解到,尽管 AMD 宣传其 128K L1 缓存,但我的 AMD 机器却拥有各 64K 的分离式 I 和 D 缓存。


感谢 Jed,现在有两条建议已经过时:

  • AMD 发布了更多有关其缓存的信息,因此您至少可以获得一些有关现代缓存的信息。 例如,去年的 AMD L1 缓存每个周期交付两个字(峰值)。

  • 开源工具valgrind内部有各种缓存模型,并且它对于分析和理解缓存行为非常有价值。 它附带了一个非常好的可视化工具 kcachegrind,它是 KDE SDK 的一部分。


例如:2008 年第 3 季度,AMD K8/K10 CPU 使用 64 字节缓存行,每个 L1I/L1D 分割缓存为 64kB。 L1D 是 2 路关联且与 L2 互斥,延迟为 3 个周期。 L2 缓存为 16 路关联,延迟约为 12 个周期。

AMD Bulldozer 系列 CPU 使用带有 16kiB 4 路关联的分离式 L1每个集群 L1D(每个核心 2 个)。

Intel CPU 长期以来一直保持 L1 不变(从 Pentium M 到 Haswell 到Skylake,大概是之后的许多代):每个 I 和 D 缓存分割 32kB,L1D 是 8 路关联。 64 字节高速缓存线,与 DDR DRAM 的突发传输大小相匹配。 加载使用延迟约为 4 个周期。

另请参阅 标签 wiki,获取更多性能的链接和微架构数据。

It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.

But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:

grep . /sys/devices/system/cpu/cpu0/cache/index*/*

This will give you associativity, set size, and a bunch of other information (but not latency).
For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.


Two suggestions which are now mostly obsolete thanks to Jed:

  • AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).

  • The open-source tool valgrind has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization tool kcachegrind which is part of the KDE SDK.


For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.

AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).

Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.

Also see the tag wiki for links to more performance and microarchitectural data.

木緿 2024-07-23 12:47:36

此英特尔手册:英特尔® 64 和 IA-32 架构优化参考手册对缓存注意事项进行了很好的讨论。

在此处输入图像描述

第 46 页,第 2.2.5.1 节 英特尔® 64 和 IA-32 架构优化参考手册

即使 MicroSlop 也意识到需要更多工具来监控缓存使用情况和性能,并且拥有一个 GetLogicalProcessorInformation() 函数 示例(...同时在过程中创建可笑的长函数名称方面开辟了新途径)我想我会编写代码。

更新一:Hazwell 将缓存负载性能提高了 2 倍,来自 在 Tock 内; Haswell 的架构

如果对充分利用缓存的重要性有任何疑问,前 Azul 员工 Cliff Click 的演示应该可以消除所有疑虑。 用他的话来说,“内存就是新磁盘!”。

Haswell 的 URS(统一预订站)

更新 II:SkyLake 显着改进的缓存性能规格。

SkyLake 缓存规格

This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.

enter image description here

Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual

Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.

UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture

If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".

Haswell’s URS (Unified Reservation Station)

UPDATE II: SkyLake's significantly improved cache performance specifications.

SkyLake Cache Specifications

哀由 2024-07-23 12:47:36

您正在查看消费者规范,而不是开发人员规范。 这是您需要的文档。 缓存大小因处理器系列子型号而异,因此它们通常不在 IA-32 开发手册中,但您可以轻松地在 NewEgg 等上查找它们。

编辑:更具体地说:第 3A 卷(系统编程指南)的第 10 章、优化参考手册的第 7 章,以及 TLB 页面缓存手册中可能的内容,尽管我认为更进一步从L1出来比你关心的多。

You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.

Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.

恏ㄋ傷疤忘ㄋ疼 2024-07-23 12:47:36

我做了更多调查。 苏黎世联邦理工学院有一个小组构建了一个内存性能评估工具,或许能够获得至少有关 L1 和 L2 缓存大小(也可能还有关联性)的信息。 该程序的工作原理是通过实验尝试不同的读取模式并测量产生的吞吐量。 Bryant 和 O'Hallaron 的流行教科书使用了简化版本。

I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.

南薇 2024-07-23 12:47:36

这些平台上存在 L1 缓存。 在内存和前端总线速度超过 CPU 速度之前,这几乎肯定会保持不变,而这很可能是一个很长的路要走。

在 Windows 上,您可以使用 GetLogicalProcessorInformation 获取某种级别的缓存信息(大小、行大小、关联性等) Win7 上的 Ex 版本将提供更多数据,例如哪些核心共享哪个缓存。 CpuZ 也提供了此信息。

L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.

On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.

后来的我们 2024-07-23 12:47:36

引用局部性对某些算法的性能有重大影响; L1、L2(以及较新的 CPU 上的 L3)缓存的大小和速度显然在其中发挥了很大的作用。 矩阵乘法就是这样一种算法。

Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.

浅黛梨妆こ 2024-07-23 12:47:36

英特尔手册卷。 2 指定以下公式来计算缓存大小:

此缓存大小(以字节为单位)

= (路数 + 1) * (分区 + 1) * (Line_Size + 1) * (组数 + 1)

= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)

其中 WaysPartitionsLine_Size使用 cpuid 查询集合,并将 eax 设置为 0x04

提供头文件声明

x86_cache_size.h

unsigned int get_cache_line_size(unsigned int cache_level);

实现如下:

;1st argument - the cache level
get_cache_line_size:
    push rbx
    ;set line number argument to be used with CPUID instruction
    mov ecx, edi 
    ;set cpuid initial value
    mov eax, 0x04
    cpuid

    ;cache line size
    mov eax, ebx
    and eax, 0x7ff
    inc eax

    ;partitions
    shr ebx, 12
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;ways of associativity
    shr ebx, 10
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;number of sets
    inc ecx
    mul ecx

    pop rbx

    ret

在我的机器上工作如下:

#include "x86_cache_size.h"

int main(void){
    unsigned int L1_cache_size = get_cache_line_size(1);
    unsigned int L2_cache_size = get_cache_line_size(2);
    unsigned int L3_cache_size = get_cache_line_size(3);
    //L1 size = 32768, L2 size = 262144, L3 size = 8388608
    printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}

Intel Manual Vol. 2 specifies the following formula to compute cache size:

This Cache Size in Bytes

= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)

= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)

Where the Ways, Partitions, Line_Size and Sets are queried using cpuid with eax set to 0x04.

Providing the header file declaration

x86_cache_size.h:

unsigned int get_cache_line_size(unsigned int cache_level);

The implementation looks as follows:

;1st argument - the cache level
get_cache_line_size:
    push rbx
    ;set line number argument to be used with CPUID instruction
    mov ecx, edi 
    ;set cpuid initial value
    mov eax, 0x04
    cpuid

    ;cache line size
    mov eax, ebx
    and eax, 0x7ff
    inc eax

    ;partitions
    shr ebx, 12
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;ways of associativity
    shr ebx, 10
    mov edx, ebx
    and edx, 0x1ff
    inc edx
    mul edx

    ;number of sets
    inc ecx
    mul ecx

    pop rbx

    ret

Which on my machine works as follows:

#include "x86_cache_size.h"

int main(void){
    unsigned int L1_cache_size = get_cache_line_size(1);
    unsigned int L2_cache_size = get_cache_line_size(2);
    unsigned int L3_cache_size = get_cache_line_size(3);
    //L1 size = 32768, L2 size = 262144, L3 size = 8388608
    printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文