冗余写入时的缓存行为
编辑 - 我想我问的问题太长了,所以我把它说得非常具体。
问题:如果某个内存位置位于 L1 缓存中且未标记为脏。假设它的值是 X。如果您尝试将 X 写入同一位置会发生什么?有没有CPU会发现这样的写入是多余的并跳过它?
例如,是否有一种优化可以比较两个值并丢弃冗余写回主内存?具体来说主流处理器是如何处理这个问题的呢?当值是像 0 这样的特殊值时怎么办?如果即使对于像 0 这样的特殊值也没有这样的优化,有什么原因吗?
动机:我们有一个可以轻松放入缓存的缓冲区。多个线程可以通过在它们之间回收来潜在地使用它。每次使用都涉及写入缓冲区中的n 个位置(不一定是连续的)。回收仅仅意味着将所有值设置为 0。每次我们回收时,size-n 位置已经是 0。对我来说(直观地),避免如此多的冗余写回将使回收过程更快,并且因此问题出现了。
在代码中执行此操作没有意义,因为分支指令本身可能会导致不必要的缓存未命中(if (buf[i]) {...} )
Edit - I guess the question I asked was too long so I'm making it very specific.
Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?
For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?
Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.
Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您建议的硬件优化不会减少延迟。考虑最低级别的操作:
步骤 1 实际上可能比步骤 2 和 3 花费更长的时间。这是因为只有步骤 1 中的旧值被带入 CPU 后,步骤 2 和 3 才能开始。如果用软件实现的话,情况也是一样的。
考虑一下我们是否只是将新值写入缓存,而不检查旧值。它实际上比上面提到的三步过程更快,原因有两个。首先,无需等待旧值。其次,CPU 可以简单地调度输出缓冲区中的写操作。输出缓冲区可以同时执行高速缓存写入,而 ALU 可以开始处理其他事情。
到目前为止,唯一涉及的延迟是 CPU 和缓存之间的延迟,而不是缓存和主内存之间的延迟。
在现代微处理器中,情况更加复杂,因为它们的缓存被组织成缓存行。当将字节值写入高速缓存行时,必须加载完整的高速缓存行,因为未重写的高速缓存行的其他部分必须保留其旧值。
http://blogs.amd.com/developer/tag/sse4a/
Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:
Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.
Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.
So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.
The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.
http://blogs.amd.com/developer/tag/sse4a/
这不是您最初关于计算机体系结构问题的答案,但可能与您的目标相关。
在本讨论中,所有数组索引都从零开始。
假设 n 远小于 size,请更改您的算法,以便节省两部分information:
每次将非零值写入全尺寸数组中的索引k时,将值k插入到集合容器中。
当需要清除全长数组时,获取集合容器中存储的每个值(其中将包含k等),并将全长数组中的每个相应索引设置为零。
也可以使用类似的技术,称为两级直方图或基数直方图。
存储两条信息:
size
的数组ceil(size / M)
的布尔数组,其中M
是基数。ceil
是上限函数。每次将非零值写入全尺寸数组中的索引 k 时,数组中的元素
floor(k / M)
布尔数组应该被标记。假设
bool_array[j]
已被标记。这对应于完整的j*M
到(j+1)*M-1
的范围- 大小数组。当需要清除全长数组时,扫描布尔数组中是否有任何标记的元素,并且其在全长数组中的相应范围应该被清除。
This is not an answer to your original question on computer-architecture, but might be relevant to your goal.
In this discussion, all array index starts with zero.
Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:
Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.
When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.
A similar technique, known as a two-level histogram or radix histogram, can also be used.
Two pieces of information are stored:
size
ceil(size / M)
, whereM
is the radix.ceil
is the ceiling function.Every time a non-zero value is written to index k in the full-size array, the element
floor(k / M)
in the boolean array should be marked.Let's say,
bool_array[j]
is marked. This corresponds to the range fromj*M
to(j+1)*M-1
in the full-size array.When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.
我不知道有任何处理器可以进行您所描述的优化 - 消除对干净缓存行的写入不会改变值 - 但这是一个好问题,一个好主意,英雄所见略同等等。
我写了一篇很大的回复,然后我想起来:这在文献中叫做“沉默的商店”。请参阅“Silent Stores for Free”,K. Lepak 和 M Lipasti,UWisc,MICRO-33,2000。
无论如何,在我的回复中我描述了一些实施问题。
顺便说一下,类似这样的话题经常在 USEnet 新闻组 comp.arch 中讨论。
我还在我的 wiki 上写了关于它们的内容,http://comp-arch.net
I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.
I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.
Anyway, in my reply I described some of the implementation issues.
By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.
I also write about them on my wiki, http://comp-arch.net