如果我不需要获取语义，那么使用轻松的负载然后是有条件的围栏是有意义的吗？

发布于 2025-02-06 20:35:44 字数 1270 浏览 1 评论 0原文

请考虑以下玩具示例，尤其是结果函数：

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

class Worker
{
    std::thread th;
    std::atomic_bool done = false;

    int value = 0;

  public:
    Worker()
        : th([&]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));
        value = 42;
        done.store(true, std::memory_order_release);
    }) {}

    int result() const
    {
        return done.load(std::memory_order_acquire) ? value : -1;
    }

    Worker(const Worker &) = delete;
    Worker &operator=(const Worker &) = delete;

    ~Worker()
    {
        th.join();
    }
};

int main()
{
    Worker w;
    while (true)
    {
        int r = w.result();
        if (r != -1)
        {
            std::cout << r << '\n';
            break;
        }
    }
}

我认为我只需要在done.load.load（）返回true时才需要获取半正确可以这样重写：

int result() const
{
    if (done.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        return value;
    }
    else
    {
        return -1;
    }
}

这似乎是一件合法的事情，但是我缺乏经验来判断这种变化是否有意义（是否更优化）。

我应该喜欢哪两种形式？

原文

Consider following toy example, especially the result function:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

class Worker
{
    std::thread th;
    std::atomic_bool done = false;

    int value = 0;

  public:
    Worker()
        : th([&]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));
        value = 42;
        done.store(true, std::memory_order_release);
    }) {}

    int result() const
    {
        return done.load(std::memory_order_acquire) ? value : -1;
    }

    Worker(const Worker &) = delete;
    Worker &operator=(const Worker &) = delete;

    ~Worker()
    {
        th.join();
    }
};

int main()
{
    Worker w;
    while (true)
    {
        int r = w.result();
        if (r != -1)
        {
            std::cout << r << '\n';
            break;
        }
    }
}

I reckon that I need acquire sematics only if done.load() returns true, so I could rewrite it like this:

int result() const
{
    if (done.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        return value;
    }
    else
    {
        return -1;
    }
}

It seems to be a legal thing to do, but I lack experience to tell if this change makes sense or not (whether it's more optimized or not).

Which of the two forms should I prefer?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哥，最终变帅啦 2025-02-13 20:35:44

如果大多数检查完成发现它不是对的，并且发生在程序的吞吐量敏感部分中，则是的，即使在ISA上，在ISA上，单独的屏障成本更高。也许是一个用例，例如出口标志，它也标志着某些数据或线程所需的指针。您经常检查，但是大多数时候您不退出，不需要以后的操作才能等待该负载完成。

这是对某些ISA的胜利（其中负载（获取）已经是负载+屏障），但是在其他ISA上，这通常更糟，尤其是如果我们最关心大多数情况（“快速路径”）是加载<的情况<代码>值。（在ISA上，围栏（获取）比负载（获取）更昂贵，尤其是带有ARMV8新说明的32位ARM：LDA只是收购负载，但栅栏仍然是<代码> dmb ish 完整障碍。）

如果！完成案例很常见并且还有其他工作要做，那么也许值得考虑权衡，因为std :: memory_order_order_consume < /代码>目前不适合其预期目的。（请参阅下面的回复：内存依赖订购解决此特定情况没有任何障碍的情况。）

对于其他常见的ISA，不，这是没有意义的，因为它会使“成功”案例变慢，如果最终以一个结局而慢得多完整的障碍。如果这是通过功能的正常快速路径，那显然会很糟糕。

在X86上没有区别：围栏（获取）是一个no-op，负载（获取）使用与负载相同的ASM（放松）。这就是为什么我们说X86的硬件内存模型是“强订购的”的原因。大多数其他主流ISA并不是这样。

对于某些ISA来说，在这种情况下，这是纯粹的胜利。对于以普通负载实现done.load（获取）的ISA，然后使用相同的屏障指令栅栏（获取）将使用（例如RISC-V或32位没有ARMV8说明的臂）。无论如何，他们都必须分支，所以我们将屏障相对于分支机构放置在哪里。（除非他们选择无条件加载value并无分支选择，例如mips movn，这是允许的，因为他们已经加载了该类Worker的另一个成员它是一个有效的指针。

对象，因此众所周知，（并且会发生在通常是快速路径上；加快“失败”路径通常并不重要。）。

这次获得的不是障碍，而是第二次负载，可能会更好。如果标志只能从0更改为1，则您甚至不需要重新检查其值。在同一线程中订购了对相同原子对象的访问。

（我有一个Godbolt链接，其中有许多ISA的示例，但是浏览器重新启动了。）

内存依赖顺序可以解决此问题而没有障碍

不幸的是std :: memory_order_consume是暂时弃用的，临时的，否则，您可以通过创建＆amp; value指针，并在done.load.load（消耗）上创建 的两全其美。。因此，value的加载（如果完成）将在完成的加载之后被依赖性订购，但是其他独立的后期加载将'必须等待。

例如if（（（tmp = done.load（消耗））））和return（＆amp; value）[tmp-1]。在ASM中，这很容易，但是如果不完全工作消耗支持，编译器将优化分支侧面仅使用tmp的使用。 tmp = true 。

因此，唯一需要在ASM中进行这种障碍权衡的ISA是Alpha，但是由于C ++限制，我们无法轻易利用其他ISA提供的硬件支持。

如果您愿意使用尽管没有保证在实践中可以使用的东西，请使用std :: atomic＆lt; int *＆gt;完成= nullptr;并进行＆amp; value而不是= true的发行店。然后在读者中，进行放松加载，而如果（tmp）{return *tmp; } else {return -1; }。如果编译器无法证明唯一的非NULL指针值是＆amp; value，则需要将数据依赖性保持在指针负载上。（要阻止它证明，也许包括set成员函数，该将任意指针存储在完成中。

） //stackoverflow.com/questions/38280633/c11-the-difference-betweew-memory-order-relaxed-and-memory-order-come-consume/59832012#59832012"> c++11：差异a>以获取详细信息，并链接到保罗·E·麦肯尼（Paul E.加载并取决于编译器，以使ASM具有数据依赖性。（这需要注意不要写东西可以优化数据依赖性的地方。）

If most checks of done find it not-done, and happens in a throughput-sensitive part of your program, yes this could make sense, even on ISAs where a separate barrier costs more. Perhaps a use-case like an exit-now flag that also signals some data or a pointer a thread will want. You check often, but the great majority of the time you don't exit and don't need later operations to wait for this load to complete.

This is a win on some ISAs (where a load(acquire) is already a load+barrier), but on others it's usually worse, especially if the case we care about most (the "fast path") is the one that loads value. (On ISAs where a fence(acquire) is more expensive than a load(acquire), especially 32-bit ARM with ARMv8 new instructions: lda is just an acquire load, but a fence is still a dmb ish full barrier.)

If the !done case is common and there's other work to do, then it's maybe worth considering the tradeoff, since std::memory_order_consume is not currently usable for its intended purpose. (See below re: memory dependency ordering solving this specific case without any barrier.)

For other common ISAs, no, it wouldn't make sense because it would make the "success" case slower, maybe much slower if it ended up with a full barrier. If that's the normal fast-path through the function, that would obviously be terrible.

On x86 there's no difference: fence(acquire) is a no-op, and load(acquire) uses the same asm as load(relaxed). That's why we say x86's hardware memory model is "strongly ordered". Most other mainstream ISAs aren't like this.

For some ISAs this is pure win in this case. For ISAs that implement done.load(acquire) with a plain load and then the same barrier instruction fence(acquire) would use (like RISC-V, or 32-bit ARM without ARMv8 instructions). They have to branch anyway, so it's just about where we place the barrier relative to the branch. (Unless they choose to unconditionally load value and branchlessly select, like MIPS movn, which is allowed because they already load another member of that class Worker object so it's known to be a valid pointer to a full object.)

AArch64 can do acquire loads quite cheaply, but an acquire barrier would be more expensive. (And would happen on what would normally be the fast path; speeding up the "failure" path is normally not important.).

Instead of a barrier, a 2nd load, this time with acquire, could possibly be better. If the flag can only change from 0 to 1, you don't even need to re-check its value; accesses to the same atomic object are ordered within the same thread.

(I had a Godbolt link with some examples for many ISAs, but a browser restart ate it.)

Memory dependency order could solve this problem with no barriers

Unfortunately std::memory_order_consume is temporarily deprecated, otherwise you could have the best of both worlds for this case, by creating an &value pointer with a data-dependency on done.load(consume). So the load of value (if done at all) would be dependency-ordered after the load from done, but other independent later loads wouldn't have to wait.

e.g. if ( (tmp = done.load(consume)) ) and return (&value)[tmp-1]. This is easy in asm, but without fully working consume support, compilers would optimize out the use of tmp in the side of the branch that can only be reached with tmp = true.

So the only ISA that actually needs to make this barrier tradeoff in asm is Alpha, but due to C++ limitations we can't easily take advantage of the hardware support that other ISAs offer.

If you're willing to use something that will work in practice despite not having guarantees, use std::atomic<int *> done = nullptr; and do a release-store of &value instead of =true. Then in the reader, do a relaxed load, and if(tmp) { return *tmp; } else { return -1; }. If the compiler can't prove that the only non-null pointer value is &value, it will need to keep the data dependency on the pointer load. (To stop it from proving that, perhaps include a set member function that stores an arbitrary pointer in done, which you never call.)

See C++11: the difference between memory_order_relaxed and memory_order_consume for details, and a link to Paul E. McKenney's CppCon 2016 talk where he explains what consume was supposed to be for, and how Linux RCU does use the kind of thing I suggested, with effectively relaxed loads and depending on the compiler to make asm with data dependencies. (Which requires being careful not to write things where it can optimize away the data dependency.)

回复收藏 0 原文

~没有更多了~