正确使用ARM PLD指令(ARM11)

发布于 2024-11-16 10:41:03 字数 257 浏览 3 评论 0 原文

ARM 实际上并没有提供太多关于该指令的正确使用方式,但我发现它在其他地方使用,知道它需要一个地址作为在哪里读取下一个值的提示。

我的问题是,给定 ldm/stm 指令的 256 字节紧密复制循环,例如 r4-r11 x 8,在复制之前在每个指令对之间预取每个缓存行是否会更好,或者根本不执行此操作,因为相关的 memcpy 并不同时读取和写入同一内​​存区域。很确定我的缓存行大小是 64 字节,但也可能是 32 字节 - 在此处编写最终代码之前等待确认。

The ARM ARM doesn't actually give much in the proper way of usage on this instruction, but I've found it used elsewhere to know that it takes an address as a hint on where to read the next value.

My question is, given a 256-byte tight copy loop of ldm/stm instructions, say r4-r11 x 8, would it be better to prefetch each cache line before the copy, in between each instruction pair, or not do it at all as the memcpy in question isn't both reading and writing to the same area of memory. Pretty sure my cache line size is 64 bytes, but it may be 32 bytes - awaiting confirmation on that before writing final code here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

荭秂 2024-11-23 10:41:03

来自 Cortex-A 系列程序员指南,第 17.4 章(注意:一些细节可能会ARM11 有所不同):

memcpy() 的最佳性能是
使用整个缓存的 LDM 实现
行,然后写入这些值
具有整个缓存行的 STM。
店铺排列更加整齐
比负载的对齐更重要。
应使用PLD指令
如有可能。有四个PLD
加载/存储单元中的插槽。可编程逻辑器件
指令优先于
自动预取器,无成本
就整数管道而言
表现。 PLD的具体时序
最佳 memcpy() 的说明可以
系统之间略有不同,但 PLD
到前面三个高速缓存行的地址
当前复制的行是
有用的起点。

From the Cortex-A Series Programmer's Guide, chapter 17.4 (NB: some details might be different for ARM11):

Best performance for memcpy() is
achieved using LDM of a whole cache
line and then writing these values
with an STM of a whole cache line.
Alignment of the stores is more
important than alignment of the loads.
The PLD instruction should be used
where possible. There are four PLD
slots in the load/store unit. A PLD
instruction takes precedence over the
automatic pre-fetcher and has no cost
in terms of the integer pipeline
performance. The exact timing of PLD
instructions for best memcpy() can
vary slightly between systems, but PLD
to an address three cache lines ahead
of the currently copying line is a
useful starting point.

随心而道 2024-11-23 10:41:03

一个相当通用的复制循环的示例,它使用缓存行大小的LDM/STM块和/或PLD(如果可用)可以在Linux 内核,arch/arm/lib/copy_page.S。这实现了 Igor 上面提到的关于预加载的使用,并说明了阻塞。

请注意,在 ARMv7(其中缓存行大小通常为 64 字节)上,不可能将完整的缓存行作为单个操作进行LDM(自 SP 以来,您只能使用 14 个寄存器) /PC 为此无法触及)。因此,您可能必须使用两对/四对LDM/STM

An example of a reasonably generic copy loop that makes use of cacheline-sized LDM/STM blocks and/or PLD where available can be found in the Linux kernel, arch/arm/lib/copy_page.S. That implements what Igor mentions above, regarding the use of preloads, and illustrates the blocking.

Note that on ARMv7 (where the cacheline size is usually 64 Bytes) it's not possible to LDM a full cacheline as a single op (there's only 14 regs you could use since SP/PC can't be touched for this). So you might have to use two/four pairs of LDM/STM.

土豪我们做朋友吧 2024-11-23 10:41:03

要真正获得“最快”的 ARM asm 代码,您需要在系统上测试不同的方法。就 ldm/stm 循环而言,这似乎对我来说效果最好:

  // Use non-conflicting register r12 to avoid waiting for r6 in pld

  pld [r6, #0]
  add r12, r6, #32

1:
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #32]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  subs r11, r11, #16
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #64]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  add r12, r6, #32
  bne 1b

上面的块假设您已经设置了 r6、r10、r11,并且该循环对 r11 的字而不是字节进行倒计时。我已经在 Cortex-A9 (iPad2) 上对此进行了测试,似乎在该处理器上有相当好的结果。但要小心,因为在 Cortex-A8 (iPhone4) 上,NEON 循环似乎比 ldm/stm 更快,至少对于较大的副本而言。

To really get the "fastest" possible ARM asm code, you will need to test different approaches on your system. As far as a ldm/stm loop goes, this one seems to work the best for me:

  // Use non-conflicting register r12 to avoid waiting for r6 in pld

  pld [r6, #0]
  add r12, r6, #32

1:
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #32]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  subs r11, r11, #16
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #64]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  add r12, r6, #32
  bne 1b

The block above assumes that your have already setup r6, r10, r11 and this loops counts down on r11 terms of words not bytes. I have tested this on Cortex-A9 (iPad2) and it seems to have quite good results on that processor. But be careful, because on a Cortex-A8 (iPhone4) a NEON loop seems to be faster than ldm/stm at least for larger copies.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文