在 Cortex-M0 中模拟 LDREX/STREX（加载/存储独占）

发布于 2024-11-02 20:57:52 字数 957 浏览 1 评论 0原文

在 Cortex-M3 指令集中，存在一系列 LDREX/STREX 指令，因此，如果使用 LDREX 指令读取某个位置，则仅当已知该地址未被更改时，后续的 STREX 指令才能写入该地址。通常，效果是如果自 LDREX 之后没有发生中断（ARM 术语中的“异常”），则 STREX 将成功，否则会失败。

在 Cortex M0 中模拟此类行为的最实用方法是什么？我想为 M3 编写 C 代码并将其移植到 M0。在 M3 上，可以这样说：

__inline void do_inc(unsigned int *dat)
{
  while(__strex(__ldrex(dat)+1,dat)) {}
}

执行原子增量。我能想到的在 Cortex-M0 上实现类似功能的唯一方法是：

让“ldrex”禁用异常并让“strex”和“clrex”重新启用它们，要求每个“ldrex”必须此后不久紧接着是“strex”或“clrex”。
让“ldrex”、“strex”和“clrex”成为 RAM 中的一个非常小的例程，其中一条“ldrex”指令被修补为“str r1,[r2]”或“mov r0,#1”。让“ldrex”例程将“str”指令插入到“strex”例程中，并让“clrex”例程在那里插入“mov r0,#1”。具有可能使“ldrex”序列调用“clrex”无效的所有异常。

根据 ldrex/strex 函数的使用方式，禁用中断可能会合理地工作，但更改“加载独占”的语义似乎很棘手，因为如果放弃它会导致严重的副作用。代码修补的想法似乎可以实现所需的语义，但看起来很笨拙。

（顺便说一句，附带问题：我想知道为什么 M3 上的 STREX 将成功/失败指示存储到寄存器而不是简单地设置标志？它的实际操作需要操作码中的四个额外位，要求寄存器可用于保存成功/失败指示，并要求使用“cmp r0，#0”来确定它是否成功。如果编译器没有在寄存器中获得结果，他们是否无法合理地处理 STREX 内在函数。 ? 将进位存入寄存器需要两个简短的指令。）

原文

In the Cortex-M3 instruction set, there exist a family of LDREX/STREX instructions such that if a location is read with an LDREX instruction, a following STREX instruction can write to that address only if the address is known to have been untouched. Typically, the effect is that the STREX will succeed if no interrupts ("exceptions" in ARM parlance) have occurred since the LDREX, but fail otherwise.

What's the most practical way to simulate such behavior in the Cortex M0? I would like to write C code for the M3 and have it portable to the M0. On the M3, one can say something like:

__inline void do_inc(unsigned int *dat)
{
  while(__strex(__ldrex(dat)+1,dat)) {}
}

to perform an atomic increment. The only ways I can think of to achieve similar functionality on the Cortex-M0 would be to either:

Have "ldrex" disable exceptions and have "strex" and "clrex" re-enable them, with the requirement that every "ldrex" must be followed soon thereafter by either a "strex" or "clrex".
Have "ldrex", "strex", and "clrex" be a very small routines in RAM, with one instruction of "ldrex" being patched to either "str r1,[r2]" or "mov r0,#1". Have the "ldrex" routine plug a "str" instruction into the "strex" routine, and have the "clrex" routine plug "mov r0,#1" there. Have all exceptions that might invalidate a "ldrex" sequence call "clrex".

Depending upon how the ldrex/strex functions are used, disabling interrupts might work reasonably, but it seems icky to change the semantics of "load-exclusive" so as to cause bad side-effects if it's abandoned. The code-patching idea seems like it would achieve the desired semantics, but it seems clunky.

(BTW, side question: I wonder why STREX on the M3 stores the success/failure indication to a register rather than simply setting a flag? Its actual operation requires four extra bits in the opcode, requires that a register be available to hold the success/failure indication, and requires that a "cmp r0,#0" be used to determine if it succeeded. Was it expected that compilers wouldn't be able to handle a STREX intrinsic sensibly if they didn't get the result in a register? Getting Carry into a register takes two short instructions.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如痴如狂 2024-11-09 20:57:52

~~嗯...你还剩下SWP，但它是一个不太强大的原子指令。~~

不过，中断禁用肯定会起作用。 :-)

编辑：

-m0 上没有 SWP，抱歉超级猫。

好吧，看来你只剩下禁用中断了。
您可以使用 gcc-compilable inline asm 作为如何禁用和正确恢复它的指南：
http:// repo.or.cz/w/cbaos.git/blob/HEAD:/arch/arm-cortex-m0/include/lock.h

回复收藏 0 原文

花开浅夏 2024-11-09 20:57:52

Cortex-M3 设计用于重度低延迟和低抖动多任务处理，即它的中断控制器与内核协作，以保证从中断触发到中断处理的周期数。 ldrex/strex 是作为一种与所有这些协作的方式来实现的（我的意思是中断屏蔽和其他细节，例如通过位带别名进行原子位设置），否则，单核、非 MMU、非缓存 µC它没什么用处。如果它没有实现它，那么低优先级任务将必须持有锁，这可能会产生小的优先级反转，并产生硬实时系统无法应对的延迟和抖动，至少不能在以下范围内：失败的 ldrex/strex 所具有的“重试”语义允许的大小。

顺便说一句，严格来说，就时序和抖动而言，Cortex-M0 具有更传统的中断时序配置文件（即，当中断到达时，它不会中止内核上的指令），因此会受到更多的抖动和延迟影响。在这个问题上（再次严格计时），它与旧型号（即arm7tdmi）更具可比性，旧型号也缺乏原子加载/修改/存储以及原子增量和原子增量。递减和其他低延迟协作指令，需要更频繁地禁用/启用中断。

我在 Cortex-M3 中使用了类似的东西：

#define unlikely(x) __builtin_expect((long)(x),0)
    static inline int atomic_LL(volatile void *addr) {
      int dest;

  __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr));
  return dest;
}

static inline int atomic_SC(volatile void *addr, int32_t value) {
  int dest;

  __asm__ __volatile__("strex %0, %2, [%1]" :
          "=&r" (dest) : "r" (addr), "r" (value) : "memory");
  return dest;
}

/**
 * atomic Compare And Swap
 * @param addr Address
 * @param expected Expected value in *addr
 * @param store Value to be stored, if (*addr == expected).
 * @return 0  ok, 1 failure.
 */
static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret;

  do {
    if (unlikely(atomic_LL(addr) != expected))
      return 1;
  } while (unlikely((ret = atomic_SC(addr, store))));
  return ret;

}

换句话说，它将 ldrex/strex 引入众所周知的链接加载和条件存储，并且还实现了比较和交换语义。

如果您的代码仅使用比较和交换就可以正常工作，您可以像这样为 cortex-m0 实现它：

static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret = 1;

   __interrupt_disable();
   if (*(volatile uint32_t *)addr) == expected) {
      *addr = store;
      ret = 0;
   }
   __interrupt_enable();
   return ret;
}

这是最常用的模式，因为某些架构最初只有它（我想到的是 x86）。

从我的立场来看，通过 CAS 实现 LL/SC 模式的模拟似乎很难看。特别是当 SC 除了 LL 之外还有多条指令时，虽然很常见，但 ARM 并不特别推荐在 Cortex-M3 情况下使用它，因为任何中断都会使 strex 失败，如果你开始在 ldrex/strex 之间花费太长时间，您的代码将花费大量时间在循环中重试 strex，这可能被解释为滥用该模式并违背其自身目的。

至于你的附带问题，在 cortex-m3 的情况下，strex 在寄存器中返回，因为语义已经由更高级别的体系结构定义（strex/ldrex 存在于定义 armv7-m 之前实现的多核臂中，并且之后，缓存控制器实际上检查 ldrex/strex 地址，即只有当缓存控制器无法证明加载/存储所触及的数据线未被修改时，strex 才会失败）。

如果我推测的话，我会说原始设计具有这种语义，因为在早期，这种原子是在库中设计的：您将返回在汇编程序中实现的函数的成功/失败，这需要尊重 ABI其中大多数（我所知道的）使用寄存器或堆栈，而不是标志来返回值。

此外，编译器在使用寄存器着色方面比在其他指令使用它时破坏标志更好，即考虑一个生成标志的复杂操作，在它的中间有一个 ldrex/strex 序列，以及后面的操作需要标志：编译器必须将标志移动到寄存器，无论如何都需要额外的指令。

The Cortex-M3 was designed to heavy low-latency and low-jitter multitasking, i.e. it's interrupt controller cooperates with the core in order to keep guarantees on number of cycles since interrupt triggering to interrupt handling. The ldrex/strex was implemented as a way to cooperate with all that (by all that I mean interrupt masking and other details such as atomic bit setting via bitband aliases), as otherwise, a single core, non-MMU, non-cache µC would have little use for it. If it didn't implement it though, a low priority task would have to hold a lock and that could generate small priority inversions, with latency and jitter which a hard real time system can't cope with, at least not within the order of magnitude allowed by the "retry" semantics that a failed ldrex/ strex has.

On a side note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt timing profile (i.e. it will not abort instructions on the core when an interrupt arrive), being subject to way more jitter and latency. On this matter (again, strictly timing), it's more comparable to older models (i.e. the arm7tdmi), which also lacks atomic load/modify/store as well as atomic increments & decrements and other low-latency cooperative instructions, requiring interrupt disable/enable more often.

I use something like this in Cortex-M3:

#define unlikely(x) __builtin_expect((long)(x),0)
    static inline int atomic_LL(volatile void *addr) {
      int dest;

  __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr));
  return dest;
}

static inline int atomic_SC(volatile void *addr, int32_t value) {
  int dest;

  __asm__ __volatile__("strex %0, %2, [%1]" :
          "=&r" (dest) : "r" (addr), "r" (value) : "memory");
  return dest;
}

/**
 * atomic Compare And Swap
 * @param addr Address
 * @param expected Expected value in *addr
 * @param store Value to be stored, if (*addr == expected).
 * @return 0  ok, 1 failure.
 */
static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret;

  do {
    if (unlikely(atomic_LL(addr) != expected))
      return 1;
  } while (unlikely((ret = atomic_SC(addr, store))));
  return ret;

}

In other words, it takes ldrex/strex into well-known Linked Load and Store Conditional, and with it it also implements the Compare and Swap semantics.

If your code does fine with only compare-and-swap, you can implement it for cortex-m0 like this:

static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret = 1;

   __interrupt_disable();
   if (*(volatile uint32_t *)addr) == expected) {
      *addr = store;
      ret = 0;
   }
   __interrupt_enable();
   return ret;
}

That's the most used pattern because some architectures originally only had it (x86 comes to mind).

Implementing an emulation of LL/SC pattern by CAS seems ugly from where I stand. Specially when the SC is more than a few instructions apart from LL, but although very common, ARM doesn't recommend it specially in the Cortex-M3 case because as any interrupts will make strex fail, if you start to taking too long between ldrex/strex your code will spend a lot of time in a loop retrying strex, which could be interpreted as abusing the pattern and defeating it's own purpose.

As for your side question, in the cortex-m3 case the strex return in a register because the semantics were already defined by higher-level architectures (strex/ldrex exists in multi-core arms that were implemented before armv7-m was defined, and after it, where the cache controllers actually check for ldrex/strex addresses, i.e. strex only fails when the cache controller can't prove the dataline the load/store touches was unmodified).

If I were to speculate, I'd say the original design have this semantic because in early days this kind of atomics were designed thinking in libraries: you'd return success/failure in functions implemented in assembler and this would need to respect the ABI and most of them (all I know off) uses a register or stack, and not the flags, to return values.

Also, compilers are better in using register coloring than to clobbering the flags in case some other instruction uses it, i.e. consider a complex operation which generates flags and in the mid of it you have a ldrex/strex sequence, and the operation that comes afterwards needs the flags: the compiler would have to move the flags to a register, requiring extra instruction(s) anyway.

回复收藏 0 原文

夏夜暖风 2024-11-09 20:57:52

尽管官方 ARM v6M 规范强烈建议将 HardFault 异常视为致命异常并在不离开处理程序上下文的情况下保持或重置芯片，但您可以在 HardFault 句柄中模拟 Cortex M0(+) 内核上缺失的指令，然后返回到错误指令之后。

m0FaultDispatch (ab) 提供的示例代码使用了此功能模拟其他缺失的指令（整数除法）。除非您非常小心并了解芯片上发生硬故障的所有可能原因，否则此类仿真可能会隐藏其他有效的硬故障原因，让您的代码继续进入未知领域。

并且没有任何仿真能够与 ARM v7M 芯片上的 LDREX/STREX 预期性能相媲美。

编辑：模拟互斥监视器需要使用 MPU 处理程序（又名 HardFault）包装所有其他异常，一些更正常的蹦床代码形式，或者向所有中断处理程序添加显式支持。

回复收藏 0 原文

骄兵必败 2024-11-09 20:57:52

STREX/LDREX 用于多核处理器访问跨内核共享的内存中的共享项。 ARM 在记录方面做得异常糟糕，您必须仔细阅读 amba/axi 以及 arm 和 trm 文档中的字里行间才能弄清楚这一点。

它的工作原理是，如果您有一个支持 STREX/LDREX 的核心，并且如果您有一个支持独占访问的内存控制器，那么如果内存控制器看到一对独占操作，中间没有其他核心访问该内存，那么您返回 EX_OKAY 而不是比好吧。 Arm文档告诉芯片设计者，如果它是单处理器（未实现多核功能），那么您不必支持exokay，只需返回okay，从软件角度来看，这会破坏LDREX / STREX对以进行访问该逻辑（软件在无限循环中旋转，因为它永远不会返回成功），但 L1 缓存确实支持它，所以感觉它可以工作。

对于单处理器以及不访问跨内核共享内存的情况，请使用 SWP。

-m0 不支持 ldrex/strex 也不支持 swp，但是这些基本上可以给你带来什么？他们只是为您提供访问权限，而不受您执行访问的影响。为了防止你踩到自己，只需在这段时间内禁用中断，这是我们自黑暗时代以来进行原子访问的方式。如果你想保护自己和外围设备，如果你有一个可能干扰的外围设备，那么没有办法解决这个问题，甚至交换也可能无济于事。

所以只需禁用关键部分周围的中断即可。

回复收藏 0 原文

~没有更多了~