优化不同阵列的 ARM 缓存使用
我想在 ARM Cortex A8 处理器上移植一小段代码。 L1 缓存和 L2 缓存都非常有限。我的程序中有3个数组。其中两个是按顺序访问的(大小>数组A:6MB和数组B:3MB),而第三个数组(大小>数组C:3MB)的访问模式是不可预测的。虽然计算不是很严格,但访问数组 C 时存在巨大的缓存未命中。我认为的一种解决方案是为数组 C 分配更多的缓存 (L2) 空间,为数组 A 和 A 分配更少的缓存 (L2) 空间。 B. 但我找不到任何方法来实现这一目标。我检查了 ARM 的预加载引擎,但找不到任何有用的东西。
I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
分割缓存并将每个数组分配在其中的不同部分是一个好主意。
不幸的是这是不可能的。 CortexA8 的缓存不够灵活。老式的 StrongArm 有一个二级缓存正是用于这种分割目的,但它不再可用了。我们有 L1 和 L2 缓存(恕我直言,总的来说这是一个很好的改变。)
但是,您可以做一件事:
CortexA8 的 NEON SIMD 单元落后于通用处理单元大约 10 个处理器周期。通过巧妙的编程,您可以从通用单元发出缓存预取,但通过 NEON 进行访问。两个管道之间的延迟为缓存提供了一些时间来进行预取,因此平均缓存未命中时间会更低。
缺点是您绝对不能将计算结果从 NEON 移回 ARM 单元。由于 NEON 滞后,这将导致 CPU 管道完全刷新。缓存未命中的成本几乎甚至更高。
性能差异可能很大。出乎意料的是,我预计速度会提高 20% 到 30%。
It would be a good idea to split the cache and allocate each array in a different part of it.
Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)
However, there is a thing you can do:
The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.
The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.
The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.
从我通过 Google 找到的信息来看,ARMv7(Cortex A8 支持的 ISA 版本)似乎具有缓存刷新功能,尽管我找不到关于如何使用它的明确参考 - 也许你可以如果您花更多的时间而不是我在搜索框中输入“ARM 缓存刷新”并阅读结果的一两分钟,效果会更好。
无论如何,您应该能够通过定期发出“刷新”指令来清除 A 和 B 中您知道不再需要的部分来实现您想要的近似结果。
From what I could find via Google, it looks like ARMv7 (which is the version of the ISA that Cortex A8 supports) has cache-flush capability, though I couldn't find a clear reference on how to use it -- perhaps you can do better if you spend more time on it than the minute or two I spent typing "ARM cache flush" into a search box and reading the results.
In any case, you should be able to achieve an approximation of what you want by periodically issuing "flush" instructions to flush out the parts of A and B that you know you no longer need.