冗余写入时的缓存行为

发布于 2024-09-11 05:48:43 字数 564 浏览 2 评论 0原文

编辑 - 我想我问的问题太长了,所以我把它说得非常具体。

问题:如果某个内存位置位于 L1 缓存中且未标记为脏。假设它的值是 X。如果您尝试将 X 写入同一位置会发生什么?有没有CPU会发现这样的写入是多余的并跳过它?

例如,是否有一种优化可以比较两个值并丢弃冗余写回主内存?具体来说主流处理器是如何处理这个问题的呢?当值是像 0 这样的特殊值时怎么办?如果即使对于像 0 这样的特殊值也没有这样的优化,有什么原因吗?

动机:我们有一个可以轻松放入缓存的缓冲区。多个线程可以通过在它们之间回收来潜在地使用它。每次使用都涉及写入缓冲区中的n 个位置(不一定是连续的)。回收仅仅意味着将所有值设置为 0。每次我们回收时,size-n 位置已经是 0。对我来说(直观地),避免如此多的冗余写回将使回收过程更快,并且因此问题出现了。

在代码中执行此操作没有意义,因为分支指令本身可能会导致不必要的缓存未命中(if (buf[i]) {...} )

Edit - I guess the question I asked was too long so I'm making it very specific.

Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?

For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?

Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.

Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

青朷 2024-09-18 05:48:44

您建议的硬件优化不会减少延迟。考虑最低级别的操作:

  1. 该位置的旧值从缓存加载到 CPU(假设它已经在缓存中)。
  2. 比较旧值和新值。
  3. 如果新旧值不同,则将新值写入缓存。否则它会被忽略。

步骤 1 实际上可能比步骤 2 和 3 花费更长的时间。这是因为只有步骤 1 中的旧值被带入 CPU 后,步骤 2 和 3 才能开始。如果用软件实现的话,情况也是一样的。

考虑一下我们是否只是将新值写入缓存,而不检查旧值。它实际上比上面提到的三步过程更快,原因有两个。首先,无需等待旧值。其次,CPU 可以简单地调度输出缓冲区中的写操作。输出缓冲区可以同时执行高速缓存写入,而 ALU 可以开始处理其他事情。

到目前为止,唯一涉及的延迟是 CPU 和缓存之间的延迟,而不是缓存和主内存之间的延迟。


在现代微处理器中,情况更加复杂,因为它们的缓存被组织成缓存行。当将字节值写入高速缓存行时,必须加载完整的高速缓存行,因为未重写的高速缓存行的其他部分必须保留其旧值。

http://blogs.amd.com/developer/tag/sse4a/

  • 缓存命中:数据从缓存行读取到目标寄存器
  • 缓存未命中:数据从内存移至缓存,并读入目标寄存器

  • 缓存命中:数据从寄存器移动到缓存行
  • Cache miss:缓存行被取入缓存,寄存器中的数据被移至缓存行

Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:

  1. The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
  2. The old and new values are compared.
  3. If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.

Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.

Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.

So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.


The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.

http://blogs.amd.com/developer/tag/sse4a/

Read

  • Cache hit: Data is read from the cache line to the target register
  • Cache miss: Data is moved from memory to the cache, and read into the target register

Write

  • Cache hit: Data is moved from the register to the cache line
  • Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line
﹏雨一样淡蓝的深情 2024-09-18 05:48:44

这不是您最初关于计算机体系结构问题的答案,但可能与您的目标相关。

在本讨论中,所有数组索引都从零开始。


假设 n 远小于 size,请更改您的算法,以便节省两部分information:

  1. 一个size的数组,
  2. 一个n的数组,以及一个计数器,用于模拟集合容器。允许重复值。

每次将非零值写入全尺寸数组中的索引k时,将值k插入到集合容器中。

当需要清除全长数组时,获取集合容器中存储的每个值(其中将包含k等),并将全长数组中的每个相应索引设置为零。


也可以使用类似的技术,称为两级直方图或基数直方图。

存储两条信息:

  1. size 的数组
  2. ceil(size / M) 的布尔数组,其中 M 是基数。 ceil 是上限函数。

每次将非零值写入全尺寸数组中的索引 k 时,数组中的元素 floor(k / M)布尔数组应该被标记。

假设 bool_array[j] 已被标记。这对应于完整的 j*M(j+1)*M-1 的范围- 大小数组。

当需要清除全长数组时,扫描布尔数组中是否有任何标记的元素,并且其在全长数组中的相应范围应该被清除。

This is not an answer to your original question on computer-architecture, but might be relevant to your goal.

In this discussion, all array index starts with zero.


Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:

  1. An array of size
  2. An array of n, and a counter, used to emulate a set container. Duplicate values allowed.

Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.

When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.


A similar technique, known as a two-level histogram or radix histogram, can also be used.

Two pieces of information are stored:

  1. An array of size
  2. An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.

Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.

Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.

When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.

极致的悲 2024-09-18 05:48:43

我不知道有任何处理器可以进行您所描述的优化 - 消除对干净缓存行的写入不会改变值 - 但这是一个好问题,一个好主意,英雄所见略同等等。

我写了一篇很大的回复,然后我想起来:这在文献中叫做“沉默的商店”。请参阅“Silent Stores for Free”,K. Lepak 和 M Lipasti,UWisc,MICRO-33,2000。

无论如何,在我的回复中我描述了一些实施问题。

顺便说一下,类似这样的话题经常在 USEnet 新闻组 comp.arch 中讨论。

我还在我的 wiki 上写了关于它们的内容,http://comp-arch.net

I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.

I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.

Anyway, in my reply I described some of the implementation issues.

By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.

I also write about them on my wiki, http://comp-arch.net

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文