CMPXCHG16B 正确吗？

发布于 2024-10-14 20:46:05 字数 1254 浏览 7 评论 0原文

尽管我不确定为什么，但这似乎并不完全正确。建议会很好，因为 CMPXCHG16B 的文档非常少（我没有任何英特尔手册...）

template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
    /*
    Description:
     The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers 
     with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, 
     and the RCX:RBX value is copied to the memory location. 
     Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
     */
    uint64_t * cmpP = (uint64_t*)&cmp;
    uint64_t * withP = (uint64_t*)&with;
    unsigned char result = 0;
    __asm__ __volatile__ (
    "LOCK; CMPXCHG16B %1\n\t"
    "SETZ %b0\n\t"
    : "=q"(result)  /* output */ 
    : "m"(*src), /* input */
      //what to compare against
      "rax"( ((uint64_t) (cmpP[1])) ), //lower bits
      "rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
      //what to replace it with if it was equal
      "rbx"( ((uint64_t) (withP[1])) ), //lower bits
      "rcx"( ((uint64_t) (withP[0]) ) )//upper bits
    : "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
    );
    return result;
}

当运行一个示例时，我得到 0，而它应该是 1。有什么想法吗？

原文

This doesn't exactly seem to be right although I am unsure why.
Advice would be great as the documentation for CMPXCHG16B is pretty minimal (I don't own any intel manuals...)

template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
    /*
    Description:
     The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers 
     with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, 
     and the RCX:RBX value is copied to the memory location. 
     Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
     */
    uint64_t * cmpP = (uint64_t*)&cmp;
    uint64_t * withP = (uint64_t*)&with;
    unsigned char result = 0;
    __asm__ __volatile__ (
    "LOCK; CMPXCHG16B %1\n\t"
    "SETZ %b0\n\t"
    : "=q"(result)  /* output */ 
    : "m"(*src), /* input */
      //what to compare against
      "rax"( ((uint64_t) (cmpP[1])) ), //lower bits
      "rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
      //what to replace it with if it was equal
      "rbx"( ((uint64_t) (withP[1])) ), //lower bits
      "rcx"( ((uint64_t) (withP[0]) ) )//upper bits
    : "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
    );
    return result;
}

When running with an example I am getting 0 when it should be 1. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫性小仙女 2024-10-21 20:46:05

注意到一些问题，

（1）主要问题是约束，“rax”并不像看起来那样，而是第一个字符“r”让 gcc 使用任何寄存器。

(2) 不确定您的存储类型如何::uint128_t，但假设 x86 平台使用标准小端，则高位和低位双字也会交换。

(3) 获取某个地址并将其转换为其他地址可能会违反别名规则。取决于您的 types::uint128_t 的定义方式，这是否是一个问题（如果它是两个 uint64_t 的结构则很好）。假设不违反别名规则，带有 -O2 的 GCC 将进行优化。

(4) *src 实际上应该被标记为输出，而不是指定内存破坏器。但这实际上更多的是性能问题而不是正确性问题。类似地，rbx 和 rcx 不需要指定为 clobbered。

这是一个有效的版本，

#include <stdint.h>

namespace types
{
    // alternative: union with  unsigned __int128
    struct uint128_t
    {
        uint64_t lo;
        uint64_t hi;
    }
    __attribute__ (( __aligned__( 16 ) ));
}

template< class T > inline bool cas( volatile T * src, T cmp, T with );

template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp, types::uint128_t with )
{
    // cmp can be by reference so the caller's value is updated on failure.

    // suggestion: use __sync_bool_compare_and_swap and compile with -mcx16 instead of inline asm
    bool result;
    __asm__ __volatile__
    (
        "lock cmpxchg16b %1\n\t"
        "setz %0"       // on gcc6 and later, use a flag output constraint instead
        : "=q" ( result )
        , "+m" ( *src )
        , "+d" ( cmp.hi )
        , "+a" ( cmp.lo )
        : "c" ( with.hi )
        , "b" ( with.lo )
        : "cc", "memory" // compile-time memory barrier.  Omit if you want memory_order_relaxed compile-time ordering.
    );
    return result;
}

int main()
{
    using namespace types;
    uint128_t test = { 0xdecafbad, 0xfeedbeef };
    uint128_t cmp = test;
    uint128_t with = { 0x55555555, 0xaaaaaaaa };
    return ! cas( & test, cmp, with );
}

Noticed a few issues,

(1) The main problem is the constraints, "rax" doesn't do what it looks like, rather the first character "r" lets gcc use any register.

(2) Not sure how your storing types::uint128_t, but assuming the standard little endian for x86 platforms, then the high and low dwords are also swapped around.

(3) Taking the address of something and casting it to something else can break aliasing rules. Depends on how your types::uint128_t is defined as to wether or not this is an issue (fine if it is a struct of two uint64_t's). GCC with -O2 will optimize assuming aliasing rules are not violated.

(4) *src should really be marked as an output, rather than specifying memory clobber. but this is really more of a performance rather than correctness issue. similarly rbx and rcx do not need to specified as clobbered.

Here is a a version that works,

#include <stdint.h>

namespace types
{
    // alternative: union with  unsigned __int128
    struct uint128_t
    {
        uint64_t lo;
        uint64_t hi;
    }
    __attribute__ (( __aligned__( 16 ) ));
}

template< class T > inline bool cas( volatile T * src, T cmp, T with );

template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp, types::uint128_t with )
{
    // cmp can be by reference so the caller's value is updated on failure.

    // suggestion: use __sync_bool_compare_and_swap and compile with -mcx16 instead of inline asm
    bool result;
    __asm__ __volatile__
    (
        "lock cmpxchg16b %1\n\t"
        "setz %0"       // on gcc6 and later, use a flag output constraint instead
        : "=q" ( result )
        , "+m" ( *src )
        , "+d" ( cmp.hi )
        , "+a" ( cmp.lo )
        : "c" ( with.hi )
        , "b" ( with.lo )
        : "cc", "memory" // compile-time memory barrier.  Omit if you want memory_order_relaxed compile-time ordering.
    );
    return result;
}

int main()
{
    using namespace types;
    uint128_t test = { 0xdecafbad, 0xfeedbeef };
    uint128_t cmp = test;
    uint128_t with = { 0x55555555, 0xaaaaaaaa };
    return ! cas( & test, cmp, with );
}

回复收藏 0 原文

一向肩并 2024-10-21 20:46:05

所有英特尔文档均免费提供：英特尔® 64 和 IA-32 架构软件开发人员手册。

回复收藏 0 原文

茶花眉 2024-10-21 20:46:05

值得注意的是，如果您使用 GCC，则不需要使用内联汇编来获取此指令。您可以使用 __sync 函数之一，例如：

template<>
inline bool cas(volatile types::uint128_t *src,
                types::uint128_t cmp,
                types::uint128_t with)
{
    return __sync_bool_compare_and_swap(src, cmp, with);
}

Microsoft 对于 VC++ 有类似的函数：

__int64 exchhi = __int64(with >> 64);
__int64 exchlo = (__int64)(with);

return _InterlockedCompareExchange128(a, exchhi, exchlo, &cmp) != 0;

It's good to note that if you're using GCC, you don't need to use inline asm to get at this instruction. You can use one of the __sync functions, like:

template<>
inline bool cas(volatile types::uint128_t *src,
                types::uint128_t cmp,
                types::uint128_t with)
{
    return __sync_bool_compare_and_swap(src, cmp, with);
}

Microsoft has a similar function for VC++:

__int64 exchhi = __int64(with >> 64);
__int64 exchlo = (__int64)(with);

return _InterlockedCompareExchange128(a, exchhi, exchlo, &cmp) != 0;

回复收藏 0 原文

爱人如己 2024-10-21 20:46:05

以下是比较的一些替代方案：

内联汇编，例如 @luke h 的答案.
__sync_bool_compare_and_swap()：GNU 扩展< /a>，仅限 gcc/clang/ICC，已弃用，编译器将至少使用 -mcx16
发出 CMPXCHG16B 指令的伪函数
atomic_compare_exchange_weak() / strong：C11 伪函数，执行 C++11 中 atomic 的操作。对于 GNU，这不会在 gcc 7 及更高版本中发出 CMPXCHG16B，而是调用 libatomic（因此必须链接到）。动态链接的 libatomic 将根据 CPU 的功能来决定使用哪个版本的函数，并且在 CPU 具有 CMPXCHG16B 功能的机器上，它将使用该版本。< /p>
显然 clang 将仍内联 CMPXCHG16B 对于 atomic_compare_exchange_weak() 或 strong

我还没有尝试过机器语言，但看看(2) 的反汇编，它看起来很完美，我不知道 (1) 如何能打败它。（我对 x86 知之甚少，但编写了很多 6502 程序。）此外，如果可以避免的话，有充分的建议永远不要使用汇编，并且至少可以使用 gcc/clang 来避免它。所以我可以把（1）从清单上划掉。

这是 gcc 版本 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC) 中 (2) 的代码：

Thread 2 "mybinary" hit Breakpoint 1, MyFunc() at myfile.c:586
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407262 <MyFunc+904>:       49 89 c2        mov    %rax,%r10
   0x0000000000407265 <MyFunc+907>:       49 89 d3        mov    %rdx,%r11
(gdb) n
587                                                  was, new ) ) {
=> 0x0000000000407268 <MyFunc+910>:       48 8b 45 a0     mov    -0x60(%rbp),%rax
   0x000000000040726c <MyFunc+914>:       48 8b 55 a8     mov    -0x58(%rbp),%rdx
(gdb) n
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407270 <MyFunc+918>:       48 c7 c6 00 d3 42 00    mov    $0x42d300,%rsi
   0x0000000000407277 <MyFunc+925>:       4c 89 d3        mov    %r10,%rbx
   0x000000000040727a <MyFunc+928>:       4c 89 d9        mov    %r11,%rcx
   0x000000000040727d <MyFunc+931>:       f0 48 0f c7 8e 70 04 00 00      lock cmpxchg16b 0x470(%rsi)
   0x0000000000407286 <MyFunc+940>:       0f 94 c0        sete   %al

然后在现实世界的算法上对 (2) 和 (3) 进行锤子测试，我没有看到真正的性能差异。即使在理论上，(3) 也只具有一个额外函数调用的开销以及 libatomic 包装函数中的一些工作，包括关于 CAS 是否成功的分支。

（使用惰性动态链接，第一次调用 libatomic 函数实际上会运行一个 init 函数，该函数使用 CPUID 来检查您的 CPU 是否有 cmpxchg16b。然后它将更新 PLT 存根的 GOT 函数指针跳过，因此以后的调用将直接转到使用 lock cmpxchg16b 的 libat_compare_exchange_16_i1 ，名称中的 i1 后缀来自 GCC 用于函数多版本控制的 ifunc 机制；如果您在不支持 cmpxchg16b 支持的 CPU 上运行它，它将把共享库函数解析为使用锁定的版本。）

在我的真实锤子测试中，该函数调用开销会损失在无锁机制所保护的功能所占用的 CPU 量中。因此，我认为没有理由使用特定于编译器的 __sync 函数，并且在启动时不推荐使用这些函数。

下面是在我的 Fedora 31 上单步执行程序集时为每个 .compare_exchange_weak() 调用的 libatomic 包装器的程序集。如果使用 -fno-plt 编译，callq *__atomic_compare_exchange_16@GOTPCREL(%rip) 将内联到调用者中，避免 PLT 并在程序启动时而不是在第一次调用时尽早运行 CPU 检测。

Thread 2 "tsquark" hit Breakpoint 2, 0x0000000000403210 in 
__atomic_compare_exchange_16@plt ()
=> 0x0000000000403210 <__atomic_compare_exchange_16@plt+0>:     ff 25 f2 8e 02 00       jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
(gdb) disas
Dump of assembler code for function __atomic_compare_exchange_16@plt:
=> 0x0000000000403210 <+0>:     jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
   0x0000000000403216 <+6>:     pushq  $0x1e
   0x000000000040321b <+11>:    jmpq   0x403020
End of assembler dump.
(gdb) s
Single stepping until exit from function __atomic_compare_exchange_16@plt,
...

0x00007ffff7fab250 in libat_compare_exchange_16_i1 () from /lib64/libatomic.so.1
=> 0x00007ffff7fab250 <libat_compare_exchange_16_i1+0>: f3 0f 1e fa     endbr64
(gdb) disas
Dump of assembler code for function libat_compare_exchange_16_i1:
=> 0x00007ffff7fab250 <+0>:     endbr64
   0x00007ffff7fab254 <+4>:     mov    (%rsi),%r8
   0x00007ffff7fab257 <+7>:     mov    0x8(%rsi),%r9
   0x00007ffff7fab25b <+11>:    push   %rbx
   0x00007ffff7fab25c <+12>:    mov    %rdx,%rbx
   0x00007ffff7fab25f <+15>:    mov    %r8,%rax
   0x00007ffff7fab262 <+18>:    mov    %r9,%rdx
   0x00007ffff7fab265 <+21>:    lock cmpxchg16b (%rdi)
   0x00007ffff7fab26a <+26>:    mov    %r9,%rcx
   0x00007ffff7fab26d <+29>:    xor    %rax,%r8
   0x00007ffff7fab270 <+32>:    mov    $0x1,%r9d
   0x00007ffff7fab276 <+38>:    xor    %rdx,%rcx
   0x00007ffff7fab279 <+41>:    or     %r8,%rcx
   0x00007ffff7fab27c <+44>:    je     0x7ffff7fab288 <libat_compare_exchange_16_i1+56>
   0x00007ffff7fab27e <+46>:    mov    %rax,(%rsi)
   0x00007ffff7fab281 <+49>:    xor    %r9d,%r9d
   0x00007ffff7fab284 <+52>:    mov    %rdx,0x8(%rsi)
   0x00007ffff7fab288 <+56>:    mov    %r9d,%eax
   0x00007ffff7fab28b <+59>:    pop    %rbx
   0x00007ffff7fab28c <+60>:    retq
End of assembler dump.

我发现使用 (2) 的唯一好处是，如果您的计算机没有附带 libatomic（旧版 Red Hat 的情况也是如此），并且您无法请求系统管理员提供此信息或者不想指望他们安装正确的产品。我亲自下载了一个源代码并错误地构建了它，因此 16 字节交换最终使用了互斥锁：灾难。

我没试过（4）。或者更确切地说，我开始但是 gcc 在没有评论的情况下通过的代码有太多警告/错误，以至于我无法在预算时间内编译它。

请注意，虽然选项 2、3 和 4 看起来像是相同的代码或几乎相同的代码应该可以工作，但事实上，所有三个选项都有显着不同的检查和警告，即使您的三个选项之一编译良好并且在 -Wall，如果您尝试其他选项之一，您可能会收到更多警告或错误。 __sync* 伪函数没有详细记录。（事实上，文档只提到了 1/2/4/8 字节，而不是它们适用于 16 字节。同时，它们的工作方式“有点”像函数模板，但你看不到模板，而且它们似乎对是否第一个和第二个参数类型实际上是相同的类型，但 atomic_* 却不是。）简而言之，这并不是您可能猜想的比较 2、3 和 4 的 3 分钟工作。

Here are some alternatives compared:

Inline assembly, such as @luke h's answer.
__sync_bool_compare_and_swap(): a GNU extension, gcc/clang/ICC-only, deprecated, pseudo-function that the compiler will issue a CMPXCHG16B instruction for at least with -mcx16
atomic_compare_exchange_weak() / strong: the C11 pseudo-function that does what atomic<> does in C++11. For GNU, this will NOT issue a CMPXCHG16B in gcc 7 and later, but instead calls libatomic (which must therefore be linked to). Dynamically linked libatomic will decide what version of functions to use based on what the CPU is capable of, and on machines where the CPU is capable of CMPXCHG16B it will use that.
apparently clang will still inline CMPXCHG16B for atomic_compare_exchange_weak() or strong

I haven't tried the machine language, but looking at the disassembly of (2), it looks perfect and I don't see how (1) could beat it. (I have little knowledge of x86 but programmed a lot of 6502.) Further, there's ample advice never to use assembly if you can avoid it, and it can be avoided at least with gcc/clang. So I can cross (1) off the list.

Here's the code for (2) in gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC):

Thread 2 "mybinary" hit Breakpoint 1, MyFunc() at myfile.c:586
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407262 <MyFunc+904>:       49 89 c2        mov    %rax,%r10
   0x0000000000407265 <MyFunc+907>:       49 89 d3        mov    %rdx,%r11
(gdb) n
587                                                  was, new ) ) {
=> 0x0000000000407268 <MyFunc+910>:       48 8b 45 a0     mov    -0x60(%rbp),%rax
   0x000000000040726c <MyFunc+914>:       48 8b 55 a8     mov    -0x58(%rbp),%rdx
(gdb) n
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407270 <MyFunc+918>:       48 c7 c6 00 d3 42 00    mov    $0x42d300,%rsi
   0x0000000000407277 <MyFunc+925>:       4c 89 d3        mov    %r10,%rbx
   0x000000000040727a <MyFunc+928>:       4c 89 d9        mov    %r11,%rcx
   0x000000000040727d <MyFunc+931>:       f0 48 0f c7 8e 70 04 00 00      lock cmpxchg16b 0x470(%rsi)
   0x0000000000407286 <MyFunc+940>:       0f 94 c0        sete   %al

Then doing hammer tests of (2) and (3) on real-world algos, I am seeing no real performance difference. Even in theory, (3) only has the overhead of one additional function call and some work in the libatomic wrapper function, including a branch on whether the CAS succeeded or not.

(With lazy dynamic linking, the first call into the libatomic function will actually run an init function that uses CPUID to check if your CPU has cmpxchg16b. Then it will update the GOT function-pointer that the PLT stub jumps through, so future calls will go straight to libat_compare_exchange_16_i1 that uses lock cmpxchg16b. The i1 suffix in the name is from GCC's ifunc mechanism for function multi-versioning; if you ran it on a CPU without cmpxchg16b support, it would resolve the shared library function to a version that used locking.)

In my real-world-ish hammer tests, that function call overhead is lost in the amount of CPU taken by the functionality the lock-free mechanism is protecting. Therefore, I don't see a reason to use the __sync, functions, that are compiler-specific and deprecated to boot.

Here's the assembly for the libatomic wrapper that gets called for each .compare_exchange_weak(), from single-stepping through the assembly on my Fedora 31. If compiled with -fno-plt, a callq *__atomic_compare_exchange_16@GOTPCREL(%rip) will get inlined into the caller, avoiding the PLT and running the CPU-detection early, at program startup time instead of on the first call.

Thread 2 "tsquark" hit Breakpoint 2, 0x0000000000403210 in 
__atomic_compare_exchange_16@plt ()
=> 0x0000000000403210 <__atomic_compare_exchange_16@plt+0>:     ff 25 f2 8e 02 00       jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
(gdb) disas
Dump of assembler code for function __atomic_compare_exchange_16@plt:
=> 0x0000000000403210 <+0>:     jmpq   *0x28ef2(%rip)        # 0x42c108 <[email protected]>
   0x0000000000403216 <+6>:     pushq  $0x1e
   0x000000000040321b <+11>:    jmpq   0x403020
End of assembler dump.
(gdb) s
Single stepping until exit from function __atomic_compare_exchange_16@plt,
...

0x00007ffff7fab250 in libat_compare_exchange_16_i1 () from /lib64/libatomic.so.1
=> 0x00007ffff7fab250 <libat_compare_exchange_16_i1+0>: f3 0f 1e fa     endbr64
(gdb) disas
Dump of assembler code for function libat_compare_exchange_16_i1:
=> 0x00007ffff7fab250 <+0>:     endbr64
   0x00007ffff7fab254 <+4>:     mov    (%rsi),%r8
   0x00007ffff7fab257 <+7>:     mov    0x8(%rsi),%r9
   0x00007ffff7fab25b <+11>:    push   %rbx
   0x00007ffff7fab25c <+12>:    mov    %rdx,%rbx
   0x00007ffff7fab25f <+15>:    mov    %r8,%rax
   0x00007ffff7fab262 <+18>:    mov    %r9,%rdx
   0x00007ffff7fab265 <+21>:    lock cmpxchg16b (%rdi)
   0x00007ffff7fab26a <+26>:    mov    %r9,%rcx
   0x00007ffff7fab26d <+29>:    xor    %rax,%r8
   0x00007ffff7fab270 <+32>:    mov    $0x1,%r9d
   0x00007ffff7fab276 <+38>:    xor    %rdx,%rcx
   0x00007ffff7fab279 <+41>:    or     %r8,%rcx
   0x00007ffff7fab27c <+44>:    je     0x7ffff7fab288 <libat_compare_exchange_16_i1+56>
   0x00007ffff7fab27e <+46>:    mov    %rax,(%rsi)
   0x00007ffff7fab281 <+49>:    xor    %r9d,%r9d
   0x00007ffff7fab284 <+52>:    mov    %rdx,0x8(%rsi)
   0x00007ffff7fab288 <+56>:    mov    %r9d,%eax
   0x00007ffff7fab28b <+59>:    pop    %rbx
   0x00007ffff7fab28c <+60>:    retq
End of assembler dump.

The only benefit I've found to using (2) is if your machine didn't ship with a libatomic (true of older Red Hat's) and you don't have the ability to request sysadmins provide this or don't want to count on them installing the right one. I personally downloaded one in source code and built it incorrectly, so that 16-byte swaps ended up using a mutex: disaster.

I haven't tried (4). Or rather I started to but so many warnings/errors on code that gcc passed without comment that I wasn't able to get it compiled in the budgeted time.

Note that while options 2, 3, and 4 seem like the same code or nearly the same code should work, in fact all three have substantially different checks and warnings and even if you have one of the three compiling fine and without warning in -Wall, you may get a lot more warnings or errors if you try one of the other options. The __sync* pseudo-functions aren't well-documented. (In fact the documentation only mentions 1/2/4/8 bytes, not that they work for 16 bytes. Meanwhile, they work "kind of" like function templates but you don't see the template and they seem finicky about whether the first and second arg types actually are the same type in a way that atomic_* aren't.) In short it wasn't the 3 minute job to compare 2, 3, and 4 that you might guess.

回复收藏 0 原文

狼性发作 2024-10-21 20:46:05

我将其编译为 g++，并进行了细微的更改（删除 cmpxchg16b 指令中的 oword ptr）。 ~~但它似乎没有按要求覆盖内存，尽管我可能是错的。~~ [查看更新] 下面给出代码，然后是输出。

#include <stdint.h>
#include <stdio.h>

namespace types
{
  struct uint128_t
  {
    uint64_t lo;
    uint64_t hi;
  }
  __attribute__ (( __aligned__( 16 ) ));
 }

 template< class T > inline bool cas( volatile T * src, T cmp, T with );

 template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp,  types::uint128_t with )
 {
   bool result;
   __asm__ __volatile__
   (
    "lock cmpxchg16b %1\n\t"
    "setz %0"
    : "=q" ( result )
    , "+m" ( *src )
    , "+d" ( cmp.hi )
    , "+a" ( cmp.lo )
    : "c" ( with.hi )
    , "b" ( with.lo )
    : "cc"
   );
   return result;
}

void print_dlong(char* address) {

  char* byte_array = address;
  int i = 0;
  while (i < 4) {
     printf("%02X",(int)byte_array[i]);
     i++;
  }

  printf("\n");
  printf("\n");

}

int main()
{
  using namespace types;
  uint128_t test = { 0xdecafbad, 0xfeedbeef };
  uint128_t cmp = test;
  uint128_t with = { 0x55555555, 0xaaaaaaaa };

  print_dlong((char*)&test);
  bool result = cas( & test, cmp, with );
  print_dlong((char*)&test);

  return result;
}

输出

FFFFFFADFFFFFFFBFFFFFFCAFFFFFFDE


55555555

不确定输出对我来说是否有意义。我期望之前的值是这样的
00000000decafbad00000feedbeef 根据结构定义。但字节似乎是分散在单词中的。这是由于一致的指令吗？顺便说一句，CAS 操作似乎返回了正确的返回值。有什么帮助破译这个吗？

更新：我刚刚使用 gdb 进行了一些内存检查调试。那里显示了正确的值。所以我想这一定是我的 print_dlong 程序的问题。请随意纠正。我留下这个回复，因为它需要更正，因为这个更正后的版本将对带有打印结果的 cas 操作具有指导意义。

I got it compiling for g++ with a slight change (removing oword ptr in cmpxchg16b instruction). ~~But it doesn't seem to overwrite the memory as required though I may be wrong.~~ [See update] Code is given below followed by output.

#include <stdint.h>
#include <stdio.h>

namespace types
{
  struct uint128_t
  {
    uint64_t lo;
    uint64_t hi;
  }
  __attribute__ (( __aligned__( 16 ) ));
 }

 template< class T > inline bool cas( volatile T * src, T cmp, T with );

 template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp,  types::uint128_t with )
 {
   bool result;
   __asm__ __volatile__
   (
    "lock cmpxchg16b %1\n\t"
    "setz %0"
    : "=q" ( result )
    , "+m" ( *src )
    , "+d" ( cmp.hi )
    , "+a" ( cmp.lo )
    : "c" ( with.hi )
    , "b" ( with.lo )
    : "cc"
   );
   return result;
}

void print_dlong(char* address) {

  char* byte_array = address;
  int i = 0;
  while (i < 4) {
     printf("%02X",(int)byte_array[i]);
     i++;
  }

  printf("\n");
  printf("\n");

}

int main()
{
  using namespace types;
  uint128_t test = { 0xdecafbad, 0xfeedbeef };
  uint128_t cmp = test;
  uint128_t with = { 0x55555555, 0xaaaaaaaa };

  print_dlong((char*)&test);
  bool result = cas( & test, cmp, with );
  print_dlong((char*)&test);

  return result;
}

Output

FFFFFFADFFFFFFFBFFFFFFCAFFFFFFDE


55555555

Not sure the output makes sense to me. I was expecting the before value to be something like
00000000decafbad00000feedbeef according to struct definition. But bytes seem to be spread out within words. Is that due to aligned directive? Btw the CAS operation seem to return the correct return value though. Any help in deciphering this?

Update : I just did some debugging with memory inspection with gdb. There the correct values are shown there. So I guess this must be a problem with my print_dlong procedure. Feel free to correct it. I am leaving this reply as it is to be corrected, since a corrected version of this would be instructive of the cas operation with printed results.

回复收藏 0 原文

~没有更多了~