与其他ARM说明相比

发布于 2025-02-06 16:07:34 字数 1057 浏览 3 评论 0原文

较新的ARM架构参考手册不再提供说明时间。 (至少对于早期的ARM2和ARM3芯片,指示时间给出了)。

我知道缓存失误会导致外部内存访问非常缓慢,例如添加x0,x1,x2或bic x0,x1,x2之类的数据指令。

但是,L1缓存速度有多快?

如果答案是“取决于...”,那将是一个粗略的猜测(球场)数字?

启用缓存(显然)。 “平面”内存映射(即虚拟地址=物理地址)。

我想答案还取决于所使用的精确硬件。而且应该简单地编写测试用例,并测量特定的时机感兴趣...

我对ARMV8 Raspberry Pi模型感兴趣 - 我不拥有。 (我正在使用QEMU)。

我也对任何其他时间感兴趣,例如:

ADD x0, xzr, xzr         ; == 1

ADD d0, d1, d2           ; floating-point

LDR x0, [x2]             ; L1 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache miss

LDP x0, x1, [x2]         ; L1 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache miss

基本上,我真正想知道的是“什么时候何时从内存中加载值而不是计算它?(ON a raspberry pi 4b)“

有页面缓存和主内存?,但这是指英特尔芯片。

The newer ARM Architecture Reference Manuals don't give instruction timings any more. (Instruction timings were given, at least for the early ARM2 and ARM3 chips).

I know that cache misses result in external memory accesses that are very slow, compared with, say, data instructions like ADD x0, x1, x2 or BIC x0, x1, x2.

But how fast is a L1 cache hit?

If the answer is "it depends ..." what would be a rough guess (ballpark) figure?

Cache enabled (obviously). "Flat" memory mapping (ie. virtual address = physical address).

I suppose the answer also depends on the precise hardware being used. And that one should simply write test cases and measure the specific timings one's interested in...

I'm interested in the ARMv8 Raspberry Pi models -- which I don't possess. (I'm using QEMU).

I'd also be interested in any other timings, say, relative to:

ADD x0, xzr, xzr         ; == 1

ADD d0, d1, d2           ; floating-point

LDR x0, [x2]             ; L1 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache miss

LDP x0, x1, [x2]         ; L1 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache miss

Basically, what I really want to know is "when is it faster to load a value from memory rather than compute it? (on a Raspberry Pi 4B)"

There's the page Approximate cost to access various caches and main memory? but that refers to Intel chips.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

故事未完 2025-02-13 16:07:34

我发现 https://develoveling.arm.arm.com/documentation/documentation/uan0016/a/a/ 从中看来L1-CACHE的LDR具有延迟4和吞吐量1。虽然基本的Alu OP具有延迟1和吞吐量2。

I found https://developer.arm.com/documentation/uan0016/a/ from which it appears that a LDR from L1-cache has latency 4 and throughput 1. While a basic ALU op has latency 1 and throughput 2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文