顺序和屏障:x86 上“lwsync”的等效指令是什么?在 PowerPC 上?

发布于 2024-09-14 16:33:56 字数 325 浏览 10 评论 0原文

我的代码很简单,如下所示。我找到了rmbwmb用于读写,但没有找到通用的。lwsync在PowerPC上可用,但是 x86 的替代品是什么?提前致谢。

#define barrier() __asm__ volatile ("lwsync")
...
    lock()
    if(!pInst);
    {
        T* temp=new T;
        barrier();
        pInst=temp;
    }
    unlock();

My code is simple as below.I found rmb and wmb for read and write,but found no general one.lwsync is available on PowerPC,but what is the replacement for x86?Thanks in advance.

#define barrier() __asm__ volatile ("lwsync")
...
    lock()
    if(!pInst);
    {
        T* temp=new T;
        barrier();
        pInst=temp;
    }
    unlock();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

阳光的暖冬 2024-09-21 16:33:56

rmb() 和 wmb() 是 Linux 内核函数。还有mb()

x86 指令为 lfencesfencemfence、IIRC。

rmb() and wmb() are the Linux kernel functions. There is also mb().

The x86 instructions are lfence, sfence, and mfence, IIRC.

家住魔仙堡 2024-09-21 16:33:56

Cilk 运行时中有一个您可能会感兴趣的特定文件,即 cilk-sysdep.h,其中包含与内存屏障相关的系统特定映射。我提取了一小部分关于 x86 即 i386 的问题,

    file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers)

    252      * We use an xchg instruction to serialize memory accesses, as can
    253      * be done according to the Intel Architecture Software Developer's
    254      * Manual, Volume 3: System Programming Guide
    255      * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6,
    256      * "For the P6 family processors, locked operations serialize all
    257      * outstanding load and store operations (that is, wait for them to
    258      * complete)."  The xchg instruction is a locked operation by
    259      * default.  Note that the recommended memory barrier is the cpuid
    260      * instruction, which is really slow (~70 cycles).  In contrast,
    261      * xchg is only about 23 cycles (plus a few per write buffer
    262      * entry?).  Still slow, but the best I can find.  -KHR 
    263      *
    264      * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster
    265      *   mfence appears to take about 125 ns on a 2.5GHZ P4
    266      *   xchgl  apears  to take about  90 ns on a 2.5GHZ P4
    267      * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH   BETTER*.
    268      *   mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801)
    269      *   sfence takes 5ns
    270      *   lfence takes 3ns
    271      *   xchgl  takes 14ns
    272      * see mfence-benchmark.c
    273      */
    274     int x=0, y;
    275     __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory");
    276    }

我喜欢这个的原因是 xchgl 似乎更快:),尽管您应该真正实现它们并检查一下。

There's a particular file in the Cilk runtime you might find interesting i.e. cilk-sysdep.h where it contains system specific mappings w.r.t memory barriers. I extract a small section w.r.t ur question on x86 i.e. i386

    file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers)

    252      * We use an xchg instruction to serialize memory accesses, as can
    253      * be done according to the Intel Architecture Software Developer's
    254      * Manual, Volume 3: System Programming Guide
    255      * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6,
    256      * "For the P6 family processors, locked operations serialize all
    257      * outstanding load and store operations (that is, wait for them to
    258      * complete)."  The xchg instruction is a locked operation by
    259      * default.  Note that the recommended memory barrier is the cpuid
    260      * instruction, which is really slow (~70 cycles).  In contrast,
    261      * xchg is only about 23 cycles (plus a few per write buffer
    262      * entry?).  Still slow, but the best I can find.  -KHR 
    263      *
    264      * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster
    265      *   mfence appears to take about 125 ns on a 2.5GHZ P4
    266      *   xchgl  apears  to take about  90 ns on a 2.5GHZ P4
    267      * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH   BETTER*.
    268      *   mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801)
    269      *   sfence takes 5ns
    270      *   lfence takes 3ns
    271      *   xchgl  takes 14ns
    272      * see mfence-benchmark.c
    273      */
    274     int x=0, y;
    275     __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory");
    276    }

What i liked about this is the fact that xchgl appears to be faster :) though you should really implement them and check it out.

小红帽 2024-09-21 16:33:56

您没有确切说明这段代码中的锁定和解锁是什么。我假设它们是互斥操作。在 powerpc 上,互斥锁获取函数将使用 isync(如果没有它,硬件可能会在 lock() 之前评估您的 if (!pInst)),并且在 unlock() 中将有一个 lwsync(如果您的互斥锁实现是古老的,则为同步) 。

因此,假设您对 pInst 的所有访问(读和写)都受到您的锁定和解锁方法的保护,那么您的屏障使用是多余的。解锁将具有足够的屏障,以确保 pInst 存储在解锁操作完成之前可见(以便在任何后续锁获取后它都可见,假设使用相同的锁)。

在 x86 和 x64 上,您的 lock() 将使用某种形式的 LOCK 前缀指令,该指令自动具有双向防护行为。

您在 x86 和 x64 上的解锁只需是存储指令(除非您在 CS 中使用某些特殊字符串指令,在这种情况下您将需要 SFENCE)。

手册:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462。 pdf

提供了有关所有围栏以及 LOCK 前缀的影响(以及何时隐含)的详细信息。

附:在你的解锁代码中,你还必须有一些强制编译器排序的东西(所以如果它只是一个存储零,你还需要像 GCC 风格 asm _易失性< /em>_ ("" ::: "内存"))。

You don't say exactly what lock and unlock are in this code. I'm presuming they are mutex operations. On powerpc a mutex acquire function will use an isync (without which the hardware may evaluate your if (!pInst) before the lock()), and will have an lwsync (or sync if your mutex implementation is ancient) in the unlock().

So, presuming all your accesses (both read and write) to pInst are guarded by your lock and unlock methods your barrier use is redundant. The unlock will have a sufficient barrier to ensure that the pInst store is visible before the unlock operation completes (so that it will be visible after any subsequent lock acquire, presuming the same lock is used).

On x86 and x64 your lock() will use some form of LOCK prefixed instruction, which automatically has bidirectional fencing behaviour.

Your unlock on x86 and x64 only has to be a store instruction (unless you use some of the special string instructions within your CS, in which case you'll need an SFENCE).

The manual:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

has good information on all the fences as well as the effects of the LOCK prefix (and when that is implied).

ps. In your unlock code you'll also have to have something that enforces compiler ordering (so if it is just a store zero, you'll also need something like the GCC style asm _volatile_ ( "" ::: "memory" ) ).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文