当前位置：文江博客话题详情

包含当时未执行的部分代码的较大二进制文件是否会影响 2 级 CPU 内存的使用？

发布于 2024-10-11 05:06:17 字数 96 浏览 13 评论 0原文

如果 L2 未填满，CPU 的运行速度似乎会显着加快。程序员是否会更好地编写最终以二进制形式更小的代码，即使该代码的某些部分并不总是被执行？比如说，仅在配置文件中打开的部分代码。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赤濁 2024-10-18 05:06:17

事实要复杂一些，我会尽力为您概述一下。

如果您查看具有多核处理器的现代 PC 中的内存层次结构，您会发现有六个级别：

预取器，每个核心一个（无延迟）
L1 缓存，一个或两个（组合或代码和数据） , 2*64K 开启
AMD K10）每个核心（延迟说
三个时钟）
L2 缓存，每个核心一个（AMD K10 上为 512K）（延迟说
10)
L3 缓存，每个处理器使用一个（AMD K10 上的 ncores*1 MB）
所有内核（例如延迟为 30）
系统 RAM，所有处理器使用每个系统一个 RAM（例如延迟
100)
同步（或总线锁定），所有总线主控使用每个系统的一种方法
设备（如果是旧 PCI 卡，则延迟至少 300 个周期，最多 1 us
正在使用所有可用的 32 个时钟
当总线主控时钟为
33 MHz - 在 3 GHz 处理器上，这意味着 3000 个时钟周期）

不要将周期计数视为精确的，它们旨在让您了解执行代码时可能产生的惩罚。

我使用同步作为内存级别，因为有时您也需要同步内存，这会花费时间。

你使用的语言会对性能产生很大的影响。用 C、C++ 或 ForTran 编写的程序比 Basic、C# 和 Java 等解释性程序更小且执行速度更快。 C 和 Fortran 还可以让您在组织数据区域和程序访问它们时更好地控制。 OO 语言（C++、C# 和 Java）中的某些功能（例如标准类的封装和使用）将导致生成更大的代码。

代码的编写方式也对性能有很大影响——尽管一些不知情的人会说现在的编译器已经很好了，没有必要编写好的源代码。优秀的代码意味着出色的性能，垃圾输入总会导致垃圾输出。

在你的问题的背景下，写得小通常比不关心更能提高性能。如果您习惯于高效编码（小/快代码），那么无论您编写的是很少使用的序列还是经常使用的序列，您都会这样做。

缓存很可能不会加载您的整个程序（尽管可能），而是从代码中的 32 或 64 字节地址获取大量 32 或 64 字节数据块（“缓存行”）。访问其中一个块中的信息越多，它将保留其所在的缓存行的时间就越长。如果核心想要一个不在 L1 中的块，它会在必要时一直搜索到 RAM，并产生时钟损失这样做时会循环。

因此，一般来说，小型、紧凑和内联的代码序列执行速度会更快，因为它们对缓存的影响较小。大量调用其他代码区域的代码将对缓存产生更大的影响，跳转未优化的代码也会对缓存产生更大的影响。分裂极其有害，但仅限于相关核心的执行。显然AMD在这方面比intel好得多（http://gmplib.org/~tege/x86 -timing.pdf）。

还有数据组织的问题。在这里，最好将常用数据驻留在物理上较小的区域中，这样一次高速缓存行提取将引入多个常用变量，而不是每次提取仅引入一个变量（这是常态）。

访问数据数组或数据结构时，请尝试确保从较低的内存地址到较高的内存地址访问它们。同样，到处访问都会对缓存产生负面影响。

最后，还有向处理器提供数据预取提示的技术，以便处理器可以在数据实际使用之前指示缓存尽可能开始获取数据。

为了有合理的机会理解这些东西，以便您可以在实际水平上使用它们，您有必要测试不同的结构并为它们计时，最好使用 rdtsc 计数器（在 stackoverflow 上有很多关于它的信息））或使用分析器。

The truth is somewhat more complex, I'll try to outline it for you.

If you look at the memory hierarchy in a modern PC with a multi-core processor you will find that there are six levels:

The prefetcher, one for every core (no latency)
The L1 cache, one or two (combined or code and data, 2*64K on
AMD K10) for every core (latency say
three clks)
The L2 cache, one (512K on AMD K10) for every core (latency say
10)
The L3 cache, one (ncores*1 MB on AMD K10) per processor used by
all cores (latency say 30)
System RAM, one per system used by all processors (latency say
100)
Synchronization (or bus lock), one method per system used by all bus mastering
devices (latency at least 300 cycles up to 1 us if an old PCI card
is using all 32 clocks available
when bus-mastering with clocking at
33 MHz - on a 3 GHz processor that means 3000 clock cycles)

Don't see the cycle counts as exact, they're meant to give you a feel for the possible penalities incurred when executing code.

I use synchronization as a memory level because sometimes you need to synchronize memory too and that costs time.

The language you use will have a great impact on performance. A program written in C, C++ or ForTran will be smaller and execute faster than an interpreted program such as Basic, C# and Java. C and Fortran will also give you a better control when organizing your data areas and program access to them. Certain functions in OO languages (C++, C# and Java) such as encapsulation and usage of standard classes will result in larger code being generated.

How code is written also has a great impact on performance - though some uninformed individuals will say that compilers are so good these days that it isn't necessary to write good source code. Great code will mean great performance and Garbage In will always result in Garbage Out.

In the context of your question writing small is usually better for performance than not caring. If you are used to coding efficiently (small/fast code) then you'll do it regardless of whether you're writing seldom- or often-used sequences.

The cache will most likely not have your entire program loaded (though it might) but rather numerous 32 or 64 byte chunks ("cache lines") of data fetched from even 32 or 64 byte addresses in your code. The more the information in one of these chunks is accessed the longer it will keep the cache line it's sitting in. If the core wants one chunk that's not in L1 it will search for it all the way down to RAM if necessary and incurring penalty clock cycles while doing it.

So in general small, tight and inline code sequences will execute faster because they impact the cache(s) less. Code that makes a lot of calls to other code areas will have a greater impact on the cache, as will code with unoptimized jumps. Divisions are extremely detrimental but only to the execution of the core in question. Apparently AMD is much better at them than intel (http://gmplib.org/~tege/x86-timing.pdf).

There is also the issue of data organization. Here it is also better to have often-used data in residing in a physically small area such that one cache line fetch will bring in several often-used variables instead of just one per fetch (which is the norm).

When accessing arrays of data or data structures try to make sure that you access them from lower to higher memory addresses. Again, accessing all over the place will have a negative impact on the caches.

Finally there is the technique of giving data pre-fetch hints to the processor so that it may direct the caches to begin fetching data as far as possible before the data will actually be used.

To have a reasonable chance of understanding these things so that you may put them to use at a practical level, it will be necessary for you to test different constructs and time them, preferably with the rdtsc counter (lots of info about it here at stackoverflow) or by using a profiler.

回复收藏 0 原文

~没有更多了~