LWARX 和 STWCX 的 x86 等效项
我正在寻找 LWARX 和 STWCX 的等效项(如 PowerPC 处理器上的那样)或在 x86 平台上实现类似功能的方法。 另外,哪里是了解此类内容的最佳地点(即有关无锁/无等待编程的好文章/网站/论坛)。
编辑
我想我可能需要提供更多细节,因为假设我只是在寻找 CAS(比较和交换)操作。 我想做的是实现一个带有智能指针的无锁引用计数系统,可以由多个线程访问和更改。 我基本上需要一种在 x86 处理器上实现以下功能的方法。
int* IncrementAndRetrieve(int **ptr) { int val; int *pval; do { // fetch the pointer to the value pval = *ptr; // if its NULL, then just return NULL, the smart pointer // will then become NULL as well if(pval == NULL) return NULL; // Grab the reference count val = lwarx(pval); // make sure the pointer we grabbed the value from // is still the same one referred to by 'ptr' if(pval != *ptr) continue; // Increment the reference count via 'stwcx' if any other threads // have done anything that could potentially break then it should // fail and try again } while(!stwcx(pval, val + 1)); return pval; }
我确实需要一些能够相当准确地模仿 LWARX 和 STWCX 的东西来实现这一目标(我无法找到一种方法来使用迄今为止为 x86 找到的 CompareExchange、交换或添加功能来实现此目的)。
谢谢
I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr) { int val; int *pval; do { // fetch the pointer to the value pval = *ptr; // if its NULL, then just return NULL, the smart pointer // will then become NULL as well if(pval == NULL) return NULL; // Grab the reference count val = lwarx(pval); // make sure the pointer we grabbed the value from // is still the same one referred to by 'ptr' if(pval != *ptr) continue; // Increment the reference count via 'stwcx' if any other threads // have done anything that could potentially break then it should // fail and try again } while(!stwcx(pval, val + 1)); return pval; }
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
正如 Michael 提到的,您可能正在寻找的是 cmpxchg 指令。
需要指出的是,完成此操作的 PPC 方法称为 加载链接/ Store Conditional (LL/SC),而 x86 架构则使用 比较和交换(CAS)。 LL/SC 的语义比 CAS 更强,因为对条件地址处的值的任何更改都将导致存储失败,即使其他更改将该值替换为与负载条件相同的值。 另一方面,CAS 在这种情况下会成功。 这称为 ABA 问题(有关更多信息,请参阅 CAS 链接)。
如果您需要 x86 架构上更强的语义,可以使用 x86 双宽比较和交换 (DWCAS) 指令
cmpxchg8b
或cmpxchg16b
来近似它x86_64。 这允许您一次原子地交换两个连续的“自然大小”单词,而不仅仅是通常的单词。 基本思想是两个词之一包含感兴趣的值,另一个包含始终递增的“突变计数”。 尽管这在技术上并不能消除问题,但突变计数器在尝试之间回绕的可能性非常低,因此对于大多数用途而言,它是一个合理的替代品。As Michael mentioned, what you're probably looking for is the
cmpxchg
instruction.It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction
cmpxchg8b
, orcmpxchg16b
under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.x86 并不像 PPC 那样直接支持“乐观并发”——相反,x86 对并发的支持基于“锁前缀”,请参阅 此处。 (一些所谓的“原子”指令,例如 XCHG,实际上是通过内在地断言 LOCK 前缀来获得其原子性,无论汇编代码程序员是否实际对其进行了编码)。 从外交角度来说,这并不完全是“防弹”的(事实上,我想说,这很容易发生事故;-)。
x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).
您可能正在寻找 cmpxchg 系列指令。
您需要在这些指令之前加上锁定指令才能获得等效的行为。
看看这里 快速了解可用内容。
您可能会得到与此类似的结果:
您应该阅读 本文...
编辑
针对更新后的问题,您是否想做类似的事情提升shared_ptr? 如果是这样,请查看该代码和该目录中的文件 - 它们肯定会帮助您入门。
You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.
如果您使用 64 位并限制自己使用 1tb 堆,则可以将计数器打包到 24 个未使用的最高位中。 如果您有字对齐指针,则底部 5 位也可用。
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
不知道 LWARX 和 STWCX 是否会使整个缓存线无效,CAS 和 DCAS 会这样做。 这意味着,除非您愿意丢弃大量内存(每个独立的“可锁定”指针 64 字节),否则如果您真的将软件推向压力,您将不会看到太大的改进。 到目前为止,我看到的最好的结果是人们有意识地放弃 64b,围绕它规划他们的结构(打包不会成为争用主题的内容),使所有内容都在 64b 边界上对齐,并使用显式的读写数据屏障。 缓存行失效可能会花费大约 20 到 100 个周期,这使其成为比锁避免更大的实际性能问题。
此外,您还必须规划不同的内存分配策略来管理受控泄漏(如果您可以将代码划分为逻辑“请求处理” - 一个请求“泄漏”,然后在最后释放其所有内存块)或数据分配管理这样,一个处于争用状态的结构永远不会接收到由同一结构/集合的元素释放的内存(以防止 ABA)。 其中一些可能非常违反直觉,但要么就是这样,要么为 GC 付出了代价。
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
您尝试做的事情不会按照您期望的方式进行。 上面实现的内容可以使用 InterlockedIncrement 函数(Win32 函数;程序集:XADD)来完成。
您的代码没有执行您认为的操作的原因是另一个线程仍然可以更改第二次读取 *ptr 和 stwcx 之间的值,而不会使 stwcx 无效。
What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.