当前位置：文江博客话题详情

设计代码以适应 CPU 缓存？

发布于 2024-08-13 04:35:20 字数 165 浏览 14 评论 0原文

在编写模拟时，我的朋友说他喜欢尝试将程序编写得足够小以适合缓存。这有什么实际意义吗？据我所知，缓存比 RAM 和主内存更快。是否可以指定您希望程序从缓存运行或至少将变量加载到缓存中？我们正在编写模拟，因此任何性能/优化增益都是巨大的好处。

如果您知道任何解释 CPU 缓存的好链接，请向我指出该方向。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乄_柒ぐ汐 2024-08-20 04:35:21

大多数 C/C++ 编译器更喜欢针对大小而不是“速度”进行优化。也就是说，由于缓存效应，较小的代码通常比展开的代码执行得更快。

回复收藏 0 原文

久伴你 2024-08-20 04:35:21

如果我是你，我会确保我知道代码的哪些部分是热点，我将其定义为

不包含任何函数调用的紧密循环，因为如果它调用任何函数，那么 PC 将花费大部分时间在该函数上函数，
它占执行时间的很大一部分（例如 >= 10%），您可以从分析器中确定这一点。（我只是手动对堆栈进行采样。）

如果您有这样的热点，那么它应该适合缓存。我不确定你是如何告诉它这样做的，但我怀疑它是自动的。

回复收藏 0 原文

孤独陪着我 2024-08-20 04:35:20

至少对于典型的桌面 CPU，您无法直接指定太多有关缓存使用情况的信息。不过，您仍然可以尝试编写缓存友好的代码。在代码方面，这通常意味着展开循环（仅举一个明显的例子）很少有用——它扩展了代码，而现代 CPU 通常会最大限度地减少循环的开销。通常，您可以在数据方面做更多的事情，以提高引用的局部性，防止错误共享（例如，两个常用的数据片段将尝试使用缓存的同一部分，而其他部分保持未使用）。

编辑（为了使某些观点更加明确）：

典型的 CPU 具有许多不同的缓存。现代桌面处理器通常至少有 2 级（通常是 3 级）缓存。根据（至少几乎）普遍共识，“1 级”是距离处理元素“最近”的缓存，并且数字从那里开始向上（接下来是 2 级，之后是 3 级，等等）

在大多数情况下，（至少）一级缓存分为两半：指令缓存和数据缓存（Intel 486 几乎是我所知道的唯一例外，指令和数据都有一个缓存 - 但它是如此彻底过时了，可能不值得太多思考）。

在大多数情况下，缓存被组织为一组“行”。高速缓存的内容通常一次一行地读取、写入和跟踪。换句话说，如果 CPU 要使用缓存行任何部分的数据，则整个缓存行将从下一个较低级别的存储中读取。离CPU越近的高速缓存通常较小，并且具有较小的高速缓存线。

这种基本架构导致了缓存的大部分特性，这些特性对于编写代码很重要。您希望尽可能将某些内容读入缓存一次，用它完成您要做的所有事情，然后继续执行其他操作。

这意味着当您处理数据时，通常最好读取相对少量的数据（足够少到适合缓存），尽可能多地处理该数据，然后继续处理下一个数据块。数据。像快速排序这样的算法可以将大量输入快速分解为逐渐变小的部分，这或多或少是自动完成的，因此它们往往对缓存相当友好，几乎不管缓存的精确细节如何。

这也会影响您编写代码的方式。如果您有这样的循环：

for i = 0 to whatever
   step1(data);
   step2(data);
   step3(data);
end for

您通常最好将尽可能多的步骤串在一起，最多为适合缓存的数量。一旦缓存溢出，性能就会急剧下降。如果上面第 3 步的代码足够大，以至于无法放入缓存中，那么通常最好将循环分成两部分，如下所示（如果可能）：

for i = 0 to whatever
    step1(data);
    step2(data);
end for

for i = 0 to whatever
    step3(data);
end for

循环展开是一个相当激烈争论的主题。一方面，它可以生成对 CPU 更加友好的代码，从而减少循环本身执行的指令的开销。同时，它可以（并且通常会）增加代码大小，因此它对缓存相对不友好。我自己的经验是，在综合基准测试中，往往会对大量数据进行少量处理，您可以从循环展开中获得很多好处。在更实用的代码中，您往往会对单个数据进行更多处理，但您获得的收益会少得多，而且缓存溢出导致严重性能损失的情况并不特别罕见。

数据缓存的大小也受到限制。这意味着您通常希望数据尽可能密集地打包，以便缓存中容纳尽可能多的数据。仅举一个明显的示例，与指针链接在一起的数据结构需要在计算复杂性方面获得相当多的收益，以弥补这些指针使用的数据缓存空间量。如果要使用链接数据结构，通常至少要确保将相对较大的数据链接在一起。

然而，在很多情况下，我发现我最初学到的用于将数据放入微型处理器中的微量内存中的技巧（大部分）已经过时了几十年，在现代处理器上效果很好。现在的目的是在缓存而不是主内存中容纳更多数据，但效果几乎相同。在相当多的情况下，您可以认为 CPU 指令几乎是免费的，并且整体执行速度由缓存（或主内存）的带宽控制，因此从密集格式中解包数据的额外处理可以在你的恩惠。当您处理的数据足够多而无法全部放入缓存时尤其如此，因此整体速度取决于主内存的带宽。在这种情况下，您可以执行很多指令来节省一些内存读取，并且仍然领先。

并行处理可能会加剧这个问题。在许多情况下，重写代码以允许并行处理实际上不会导致性能提升，有时甚至会导致性能损失。如果整体速度由从 CPU 到内存的带宽控制，那么让更多内核竞争该带宽不太可能有任何好处（并且可能会造成重大损害）。在这种情况下，使用多核来提高速度通常归结为做更多的事情来更紧密地打包数据，并利用更多的处理能力来解包数据，因此真正的速度增益来自于减少消耗的带宽，而额外的内核可以避免浪费时间从更密集的格式中解包数据。

并行编码中可能出现的另一个基于缓存的问题是变量共享（和错误共享）。如果两个（或更多）核心需要写入内存中的同一位置，则保存该数据的缓存行最终可能会在核心之间来回穿梭，以使每个核心都可以访问共享数据。结果通常是并行运行的代码比串行（即在单核上）运行的速度慢。这种情况有一种变体，称为“错误共享”，其中不同内核上的代码写入单独的数据，但不同内核的数据最终位于同一缓存行中。由于缓存纯粹根据整行数据来控制数据，因此数据无论如何都会在核心之间来回移动，从而导致完全相同的问题。

At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).

Edit (to make some points a bit more explicit):

A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)

In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).

In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.

This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.

This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.

This also has implications for how you write code. If you have a loop like:

for i = 0 to whatever
   step1(data);
   step2(data);
   step3(data);
end for

You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):

for i = 0 to whatever
    step1(data);
    step2(data);
end for

for i = 0 to whatever
    step3(data);
end for

Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.

The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.

In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.

Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.

Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.

回复收藏 0 原文

樱桃奶球 2024-08-20 04:35:20

这是一篇关于缓存/内存优化的非常好的论文的链接克里斯特·埃里克森（因《战神 I/II/III》而闻名）。它已经有几年的历史了，但仍然非常相关。

回复收藏 0 原文

只是偏爱你 2024-08-20 04:35:20

What Every Programmer Should Know About 一篇有用的论文将告诉您比您想了解的更多有关缓存的信息乌尔里希·德雷珀的记忆。 Hennessey 非常彻底地涵盖了它。 Christer 和 Mike Acton 也就此写了很多好文章。

我认为你应该更多地担心数据缓存而不是指令缓存——根据我的经验，dcache 未命中更频繁、更痛苦，并且更有效地修复。

回复收藏 0 原文

在梵高的星空下 2024-08-20 04:35:20

更新：2014 年 1 月 13 日
根据这位高级芯片设计师的说法，缓存未命中现在是代码性能的压倒性主导因素，因此就加载、存储、整数的相对性能瓶颈而言，我们基本上一直回到了 80 年代中期和快速 286 芯片。算术和缓存未命中。

Cliff Click @ Azul 的现代硬件速成课程
。
。
。
。
。

--- 我们现在让您回到定期安排的计划 ---

有时，示例比描述如何做某事更好。本着这种精神，这里有一个特别成功的示例，展示了我如何更改一些代码以更好地在芯片缓存上使用。这是不久前在 486 CPU 上完成的，后来迁移到第一代 Pentium CPU。对性能的影响是相似的。

示例：下标映射

下面是我用来将数据放入具有通用用途的芯片缓存中的技术示例。

我有一个 1,250 个元素长的双浮点向量，这是一条尾巴很长的流行病学曲线。曲线的“有趣”部分只有大约 200 个唯一值，但我不希望 2 边 if() 测试弄乱 CPU 的管道（因此长尾，可以用作最极端的下标）蒙特卡洛代码会吐出的值），并且我需要代码中“热点”内的十几个其他条件测试的分支预测逻辑。

我确定了一个方案，使用 8 位整数向量作为双精度向量的下标，并将其缩短为 256 个元素。小整数在零之前的 128 之前和零之后的 128 之前都具有相同的值，因此除了中间的 256 个值之外，它们都指向 double 向量中的第一个或最后一个值。

这将双精度数的存储需求缩减至 2k，8 位下标的存储需求缩减至 1,250 字节。这将 10,000 字节缩减为 3,298 个字节。由于程序花费了 90% 或更多的时间在此内部循环中，因此这 2 个向量从未被推出 8k 数据缓存。该程序的性能立即提高了一倍。在计算 1+00 万抵押贷款的 OAS 值的过程中，该代码被命中约 1000 亿次。

由于曲线的尾部很少被触及，因此很可能只有微小 int 向量的中间 200-300 个元素实际上保存在缓存中，以及代表 1/8 的兴趣百分比的 160-240 个中间双精度数。对于我花了一年多优化的程序来说，一个下午就完成了性能的显着提高。

我同意 Jerry 的观点，因为我的经验也表明，将代码向指令缓存倾斜并不像优化数据缓存那么成功。这是我认为 AMD 的通用缓存不如 Intel 的独立数据和指令缓存那么有用的原因之一。 IE：您不希望指令占用缓存，因为它没有太大帮助。部分原因是 CISC 指令集最初是为了弥补 CPU 和内存速度之间的巨大差异而创建的，除了 80 年代末的一个偏差之外，这几乎总是如此。

我最喜欢的另一种有利于数据缓存并破坏指令缓存的技术是在结构定义中使用大量位整数，并且通常使用尽可能小的数据大小。要屏蔽 4 位 int 来保存一年中的月份，或 9 位来保存一年中的日期等，需要 CPU 使用掩码来屏蔽这些位正在使用的主机整数，这会缩小数据，有效地增加了缓存和总线大小，但需要更多指令。虽然这种技术生成的代码在综合基准测试中表现不佳，但在用户和进程争夺资源的繁忙系统上，它的效果却非常好。

UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.

A Crash Course In Modern Hardware by Cliff Click @ Azul
.
.
.
.
.

--- we now return you to your regularly scheduled program ---

Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.

Example: Subscript Mapping

Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.

I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.

I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.

This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.

Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.

I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.

Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.

回复收藏 0 原文

夜司空 2024-08-20 04:35:20

大多数情况下，这将作为一个占位符，直到我有时间正确地讨论这个主题，但我想分享我认为是一个真正突破性的里程碑 - 在新的英特尔 Hazwell 微处理器中引入专用位操作指令。

当我在 StackOverflow 上编写一些代码来反转 4096 位数组中的位时，这一点变得非常明显，在 PC 推出 30 多年后，微处理器并没有在位上投入太多注意力或资源，我希望改变。特别是，对于初学者来说，我很乐意看到 bool 类型成为 C/C++ 中的实际位数据类型，而不是目前浪费得可笑的字节。

Hazwell 的新位操作指令

更新：2013 年 12 月 29 日

我最近有机会优化一个环形缓冲区以毫秒粒度跟踪系统上 512 个不同资源用户的需求。有一个计时器每毫秒触发一次，它添加了最新切片的资源请求的总和并减去第 1,000 个时间片的请求，其中包括现在 1,000 毫秒之前的资源请求。

头向量和尾向量在内存中是紧挨着的，除了先是头向量，然后是尾向量，然后从数组的开头开始。然而，（滚动）Summary 切片位于一个固定的、静态分配的数组中，该数组与其中任何一个都不是特别接近，甚至不是从堆中分配的。

思考这一点并研究代码，一些细节引起了我的注意。

传入的需求被同时添加到 Head 和 Summary 切片中，在相邻的代码行中彼此相邻。
当计时器触发时，Tail 会从 Summary 切片中减去，并且结果会保留在 Summary 切片中，如您所料
当计时器触发时调用的第二个函数会提前为环提供服务的所有指针。尤其....
Head 覆盖了 Tail，从而占用了相同的内存位置
新的 Tail 占用了接下来的 512 个内存位置，或者包裹了
用户希望在管理的需求数量方面具有更大的灵活性，从 512 到 4098，甚至更多。我认为最可靠、最简单的方法是将 1,000 个时间片和摘要片一起分配为一个连续的内存块，这样摘要片就不可能最终成为不同的长度比其他 1,000 个时间片要多。
鉴于上述情况，我开始想知道，如果我不让 Summary 切片保留在一个位置，而是让它在 Head 和 Tail 之间“漫游”，那么我是否可以获得更高的性能，因此它总是紧挨着Head 用来添加新的需求，当计时器触发时，Tail 就在 Tail 旁边，并且必须从 Summary 中减去 Tail 的值。

我正是这么做的，但随后在这个过程中发现了一些额外的优化。我更改了计算滚动摘要的代码，以便将结果保留在尾部，而不是摘要切片中。为什么？因为下一个函数正在执行 memcpy() 将 Summary 切片移动到 Tail 刚刚占用的内存中。（奇怪但真实，尾部引导头部，直到环缠绕时的末端）。通过将求和结果保留在 Tail 中，我不必执行 memcpy()，我只需将 pTail 分配给 pSummary。

以类似的方式，新的 Head 占用了现在已过时的 Summary 切片的旧内存位置，因此，我再次将 pSummary 分配给 pHead，并使用 memset 将其所有值清零为零。

引导到环的末端（实际上是一个鼓，512 条轨道宽）的是 Tail，但我只需将其指针与常量 pEndOfRing 指针进行比较即可检测该情况。所有其他指针都可以分配其前面的向量的指针值。 IE：我只需要对 1:3 的指针进行条件测试即可正确包装它们。

最初的设计使用字节整数来最大化缓存使用率，但是，我能够放松这一限制 - 满足用户要求每毫秒处理每个用户更高的资源计数 - 使用无符号短整型并且仍然双倍性能，因为即使有 3 个相邻的 512 个无符号短整型向量，L1 缓存的 32K 数据缓存也可以轻松容纳所需的 3,720 字节，其中 2/3 位于刚刚使用的位置。仅当尾部、摘要或头部包装是由 8MB L3 缓存中的任何重要“步骤”分隔的 3 个中的 1 个时。

该代码的总运行时内存占用量低于 2MB，因此它完全在片上缓存中运行，即使在具有 4 核的 i7 芯片上，也可以运行该进程的 4 个实例，而不会降低性能，并且运行 5 个进程时总吞吐量略有上升。这是关于缓存使用的巨著。

Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.

It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.

Hazwell's new Bit Manipulation Instructions

UPDATE: 12/29/2013

I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.

The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.

Thinking about this, and studying the code a few particulars caught my attention.

The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.

I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.

In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.

Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.

The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.

The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.

回复收藏 0 原文

~没有更多了~