现代 Intel 或 AMD CPU 上的分散写入速度与分散读取速度?
我正在考虑通过采用线性数组并将每个元素写入另一个数组中的任意位置(从 CPU 的角度来看是随机的)来优化程序。我只进行简单的写入,而不读回元素。
据我所知,传统 CPU 的分散读取可能会非常慢,因为每次访问都会导致缓存未命中,从而导致处理器等待。但我认为分散写入在技术上可能会很快,因为处理器不等待结果,因此它可能不必等待事务完成。
不幸的是,我不熟悉经典 CPU 内存架构的所有细节,因此可能会出现一些复杂情况,导致速度也相当慢。
有人试过这个吗?
(我应该说我正在尝试解决我遇到的问题。我目前有一个线性数组,我可以从中读取任意值 - 分散读取 - 并且由于所有缓存未命中而速度非常慢。我的想法是我可以将此操作反转为分散写入,以获得显着的速度优势。)
I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一般来说,对尚未在缓存中的地址进行分散写入会付出高昂的代价,因为每次写入都必须加载和存储整个缓存行,因此 FSB 和 DRAM 带宽要求将比顺序写入高得多。当然,每次写入都会导致缓存未命中(现代 CPU 上通常会发生几百个周期),并且任何自动预取机制都没有帮助。
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
我必须承认,这听起来有点硬核。但我还是冒着风险回答。
是否可以将输入数组分为多个页面,并多次读取/扫描每个页面。每次浏览页面时,您仅处理(或输出)属于有限数量页面的数据。这样,您只会在每个输入页面循环开始时出现缓存未命中。
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.