如何确定矩阵循环的行/列步幅以获得这些缓存命中率？

发布于 2025-01-12 03:51:51 字数 735 浏览 2 评论 0原文

给定 CPU：
L1缓存：4路，块大小= 32字节，缓存大小= 64KB，LRU（缓存替换策略）。
L2缓存：2路，块大小= 32字节，缓存大小= 512KB，LRU（缓存替换策略）。

数组：int32_t A[2048][128]; 因此元素大小为 4 个字节。

for(int k=0, k<50, k++)
   for(int i=0; i<2048; i+=Di)
       for(int j=0; j<128; j+=Dj)  
           S += A[i][j]^k;

对于常量 Di 和 Dj 的哪些值，我可以获得 L1 命中率 = 75% 和 L2 命中率大于 90%？

我尝试执行以下操作：
我知道对于 L1 缓存，地址是：offset = 5 位，set = 9 位。
因此，在 2^14 B = 2^12 元素之后，我们到达相同的集合（但以其他方式）。
在 L2 缓存中：offset = 5 位，set = 13 位。
因此，在 2^18 B = 2^16 个元素之后，我们到达相同的集合（但以其他方式）。因为我们希望 L1 命中率 = 3/4 。所以我们将采用Dj=2。因此我们将得到任意 4 个元素：未击中、击中、击中、击中。（L1 的任何块都可以包含 8 个元素）。

但我不知道如何确定 Di 的值。

原文

Given CPU with:
L1 cache: 4-ways, block size = 32 bytes , cache size = 64KB , LRU (Cache replacement policy).
L2 cache: 2-ways, block size = 32 bytes , cache size = 512KB , LRU (Cache replacement policy).

Array: int32_t A[2048][128]; so element size is 4 bytes.

for(int k=0, k<50, k++)
   for(int i=0; i<2048; i+=Di)
       for(int j=0; j<128; j+=Dj)  
           S += A[i][j]^k;

For which values of the constants Di and Dj can I get L1 hit rate = 75% and L2 hit rate bigger than 90%?

I tried to do the following:
I know that for L1 cache the address is: offset = 5 bits, set = 9 bits.
And hence, after 2^14 B = 2^12 elements we arrive to same set (but in other way).
and in L2 cache: offset = 5 bits, set = 13 bits.
And hence, after 2^18 B = 2^16 elements we arrive to same set (but in other way).
And because that we want L1 hit rate = 3/4 . So we will take Dj=2. And so we will get for any 4 elements: miss, hit, hit, hit. (any block of L1 can contain 8 elements).

But I don't know how to decide value for Di.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏雨一样淡蓝的深情 2025-01-19 03:51:51

理论背景

为了决定我们想要读取的内存行/列（步长）以获得所需的缓存命中/未命中率，我们需要了解 L1 和 L2 缓存如何工作以及输入参数的含义是什么我们是针对这个问题给出的。在这个问题中，我们已经有了最终的计算结果，但我将快速解释我们是如何得出这个数字的：

缓存是按块和集合组织的。每个集合可以包含n个块，因此我们将其称为“n路集合关联”缓存。在示例中，我们有 4 路和 2 路。块中字节的地址构造如下：

Cache address = INDEX[0:N] + OFFSET[0:M]

其中 INDEX 表示集合的索引，有 N 位，偏移量由块的大小确定，有 M 位。

从L1缓存开始，我们说这个缓存是“4路”的。这意味着缓存被组织成 32 字节的组，每个组可以映射到与其关联的 4 个不同的 L2 块。由于 L1 有 64KB，每个块都是 32 位，因此我们将有 2,048 个块 (64K/32)，或每组 512 个块 (2,048/4)。集合的地址由索引确定，因此我们的 L1 索引有 9 位（N = 9）。块中的字节由偏移量决定，由于我们有 32 个字节 (2^5)，因此我们将有 5 位 (M=5) 用于偏移量。

L2 缓存是“2 路”，512KB，32 位块，因此块数将为 16384 (512KB/32)，或 8,192 组 (16,384/2)，因此 N=13。由于 L2 块是 32 字节，因此偏移量将为 5 位 (M=5)。

解决方案

首先，我们假设缓存专用于 A 数组，即变量 k、i、j< /code> 和 S 存储在寄存器中，因此对它们的读写不会引起内存操作。

当执行指令 S += A[i][j]^k; 时，处理器需要从 i< 上的 A 中获取元素/code> 行和 j 列。假设我们刚开始执行，缓存是空的，因此当我们尝试访问A[i][j]时，CPU会尝试从L1加载值。由于 L1 是空的，因此 L1 将会“未命中”，这将导致 CPU 尝试从 L2 加载数据。 L2 也是空的，因此 L2 也“未命中”，CPU 会将整个块（32 字节）从内存移动到 L2，然后移动到 L1。

由于“A”是 32 位（4 字节），因此该块包含 8 个元素。如果后续读取都来自同一个块，则该块将已经在 L1 缓存中，因此我们将“命中”。这将为我们提供 7/8 的命中率 (87.5%)，这超出了我们的要求，因此为了降低比率并使其达到 3/4 (75%)，我们需要确保从 A 读取每个第二个元素（假设 A 与块大小边界对齐）。为此，我们需要选择 Dj 为 2。

为了确保保留 75%，我们需要确保 L1 将被覆盖，否则在下一个周期我们将找到这些值在缓存中，这将导致我们在后续周期中的命中率为 100%。只要我们确保在循环中读取超过 2,048 个块，这就不会成为问题，这意味着我们需要读取超过 8,192 (16,384/2) 个元素。为了确保这一点，我们将使用Di。

因此，根据我们目前所知，我们有：

   (max(i)/ Di) * (max(j)/Dj) * size(A[0][0]) / block size  > 2048   :>
   ((2048 * 128/2 * 4) / 32) / Di > 2048                             :> 
   16384 / Di > 2048                                                 :>
   Di < 16384/2048                                                   :>
   Di < 8                                                            :>
   1 <= Di < 8

这意味着 Di 需要是 1 到 7 之间的数字，以确保 L1 命中率保持在 75%。

此外，Di数的选择应保证L2命中率大于90%。

正如我们提到的，当 L1 为“未命中”时，CPU 将尝试从 L2 获取数据，正如我们之前所看到的，只有当 L1 位于块的第一个元素时才会发生这种情况。由于我们要求L2命中率> 90%，我们尽量让它接近100%。 L2 可以容纳 131,072 个元素，因此我们尝试在第一次运行时用元素填充缓存，然后在随后的 49 次运行（循环 k）中，我们将为每个元素命中缓存。这将为我们提供大约 98% 的命中率（超过 90%）。考虑到我们将从缓存中读取 128/Dj 元素，并且我们想要填充所有 13,1072 个元素，我们有以下约束：

    max(i) / Di = 131072 / max(j) / Dj  :>
    max(i) / Di = 131072 / 128 / 2      :>
    2048 / Di = 2048                    :>
    Di = 1

因此，如果我们选择 Di如果为 1，Dj 为 2，则 L1 上的缓存命中率为 75%，L2 上的命中率为 98%（这是所需的 90% 以上）。

事实上，由于外部循环（循环 k），任何小于 7 的 Di 数字都将为我们提供 98% 的缓存命中率，因为它将确保我们读取超过 2,048 个元素且小于或等于 131,072 个元素。

对真实 CPU 的更正

需要理解的重要一点是，当我们从包含数据的内存段中的数组“A”中获取数据时，我们还需要从内存中的代码段中获取指令。考虑到这一点，在真实的处理器中，对于为指令获取的数据量，高速缓存命中率将较低。

为了解决这个问题，我们假设将为代码保留一组块（一个“路”）。根据这一假设，我们的命中率可以高于所需的比率，但我们将确保该比率不低于所需的比率。

有了这个假设，我们将有 L1 而不是 2,048 个块和 512 个集合，我们将有 1,536 个组织在 512 个集合（3 路缓存）中，可用于“A”，而对于 L2，我们将有 1,536 个块，而不是组织在 8,192 个集合中的 16,384 个块，我们将有 8,192 个块组织在 8,192 个集合（或直接映射的缓存）中，可用于“A”。校正后，我们可以使用与上述相同的方法来查找 Di 和 Dj 的值。

校正后，Dj 的值不会改变，仍为 2，但 Di 将在 1 到 5 之间。

Theoretical background

To decide the rows/columns (strides) of memory that we want to read in order to get the desired cache hit/miss ratio, we need to understand how the L1 and L2 cache work and what is the meaning of the input parameters we are given for this problem. In the question we already have the final calculation, but I will quickly explain how we had arrived to this numbers:

The cache is organised in blocks and sets. Each set can contain n number of blocks, so we call this "n-way set associative" cache. In the example we have 4-way and 2-way. The address of a byte in the block is constructed as follows:

Cache address = INDEX[0:N] + OFFSET[0:M]

where INDEX is representing the index of the set and has N bits and offset is determined by the size of the block and has M bits.

Starting from L1 cache, we are saying that this cache is "4-way". This means that the cache is organised in to sets of 32 bytes and each set can have a mapping to 4 different L2 blocks associated with it. Since we have 64KB for the L1, and each block is 32bit, we will have 2,048 blocks (64K/32), or 512 blocks per set (2,048/4). The address of a set is determined by the index, so our L1 index has 9 bits (N = 9). The byte in the block is determined by the offset, and since we have 32 bytes (2^5) we will have 5 bits (M=5) for the offset.

L2 cache is "2-way" with 512KB with 32bit blocks, so number of blocks will be 16384 (512KB/32), or 8,192 sets (16,384/2) therefore N=13. Since the L2 block is 32 bytes the offset will be 5 bits (M=5).

Solution

For a start we will assume that the cache is dedicated for the A array, meaning the variable k, i, j and S are stored in registers, so read/write for them will not cause memory operations.

When the instruction S += A[i][j]^k; is executed, the processor will need to fetch the element from the A on i row and j column. Assuming that we just started the execution, the cache is empty, therefore when we try to access A[i][j], the CPU will try to load the value from L1. Since L1 is empty, we will have "miss" for L1 and which will cause the CPU to try to load the data from L2. L2 is also empty, so we have "miss" for L2 as well and the CPU will move the whole block (32 bytes) from the memory to L2 and then to L1.

Since "A" is 32 bit (4 bytes), the block contains 8 elements. If the subsequent reads are all from the same block, the block will already be in the L1 cache, so we will have "hit". This will give us 7/8 hit ratio (87.5%) which is more than what we are looking for, so to lower the ration and to make it 3/4 (75%) we need to assure we read each second element from A (assuming A is aligned to the block size boundary). To do so, we need to select Dj to be 2.

To assure we keep the 75%, we need to make sure that the L1 will be overwritten, otherwise on the next cycle we will find the values in the cache which will cause our hit ratio to be 100% on the subsequent cycles. This will not be a problem as long as we assure that we are reading more than 2,048 blocks in the cycle, which means that we need to read more than 8,192 (16,384/2) elements. And to assure this we will use Di.

So, from what we know at this moment, we have:

   (max(i)/ Di) * (max(j)/Dj) * size(A[0][0]) / block size  > 2048   :>
   ((2048 * 128/2 * 4) / 32) / Di > 2048                             :> 
   16384 / Di > 2048                                                 :>
   Di < 16384/2048                                                   :>
   Di < 8                                                            :>
   1 <= Di < 8

This means that Di needs to be a number between 1 and 7 to assure L1 hit ratio will remain 75%.

In addition, the Di number should be selected in a way to that will assure the L2 hit ratio is more than 90%.

As we mentioned that when L1 is "miss", the CPU will try to fetch the data from L2, and as we saw before, this will happen only when L1 is at the first element of the block. Since we have the requirement for the L2 hit ratio to be > 90%, let's try to make it as close as possible to 100%. The L2 can fit 131,072 elements, so let's try to fill the cache with elements on the first run and then on the subsequent 49 runs (the loop k) we will have cache hit for each element. This will give us around 98% hit ratio (which is more than 90%). Considering that we will read 128/Dj elements from the cache, and we want to fill all 13,1072 elements, we have the following constraint:

    max(i) / Di = 131072 / max(j) / Dj  :>
    max(i) / Di = 131072 / 128 / 2      :>
    2048 / Di = 2048                    :>
    Di = 1

Therefore, if we select Di to be 1 and Dj to be 2, we will have 75% cache hit on L1 and we will have 98% hit on L2 (which is more than 90% required).

In fact, thanks to the outer loop (the loop k), any number for Di that is less than 7 will give us the 98% of cache hit ratio since it will assure that we read more than 2,048 elements and less or equal to 131,072 elements.

Corrections for a real CPU

Important thing to understand is that, while we are fetching data from the array "A" from the memory segment containing the data, we also need to fetch the instructions from the code segment in memory. Considering this, in a real processor, the cache hit ratio will be lower for the amount of data that will be fetched for the instructions.

To address this, we will assume that one set of blocks (one "way") will be reserved for the code. With this assumption our hit ratio can be higher than the desired ratio, but we will assure that the ratio is not lower than the desired one.

With this assumption we will have for L1 instead of 2,048 blocks and 512 sets, we will have 1,536 organised in 512 sets (3-way cache) that can be used for "A", and for L2 instead of 16,384 blocks organised in 8,192 sets, we will have 8,192 blocks organised in 8,192 sets (or directly mapped cache) that we can use for "A". After the correction we can use the same approach as above to find values for Di and Dj.

After the correction the value of Dj will not be changed, it will still be 2, however the Di will be between 1 and 5.

回复收藏 0 原文

~没有更多了~