ARM64 64位加载/商店数据竞赛

发布于 2025-02-10 12:32:54 字数 721 浏览 1 评论 0 原文

根据,一个64位加载/商店被认为是ARM64上的原子访问。 使用此竞赛时,请忽略ARM64(忽略对其他内存访问的订购)

uint64_t x;

// Thread 1
void f()
{
  uint64_t a = x;
}

// Thread 2
void g()
{
  x = 1;
}

鉴于此,以下程序仍被视为具有数据竞赛(因此可以展示UB),如果我将其切换到使用的

std::atomic<uint64_t> x{};

// Thread 1
void f()
{
  uint64_t a = x.load(std::memory_order_relaxed);
}

// Thread 2
void g()
{
  x.store(1, std::memory_order_relaxed);
}

第二个程序,则 ?

在ARM64上,看来编译器最终为普通的64位加载/存储和带有 Memory_order_order_relaxed 的原子的加载/存储生成相同的指令,那么有什么区别?

According to this, a 64 bit load/store is considered to be an atomic access on arm64. Given this, is the following program still considered to have a data race (and thus can exhibit UB) when compiled for arm64 (ignore ordering with respect to other memory accesses)

uint64_t x;

// Thread 1
void f()
{
  uint64_t a = x;
}

// Thread 2
void g()
{
  x = 1;
}

If instead I switch this to using

std::atomic<uint64_t> x{};

// Thread 1
void f()
{
  uint64_t a = x.load(std::memory_order_relaxed);
}

// Thread 2
void g()
{
  x.store(1, std::memory_order_relaxed);
}

Is the second program considered data race free?

On arm64, it looks like the compiler ends up generating the same instruction for a normal 64 bit load/store and a load/store of an atomic with memory_order_relaxed, so what's the difference?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幸福不弃 2025-02-17 12:32:54

std :: Atomic 解决了4个问题。

一个是负载/存储是原子的,这意味着您不会得到负载,并且存储融合在一起,因此例如,从商店开始加载32位,而其他32位从商店开始。通常,在CPU本身上,要注册大小的所有内容自然都是原子质。事情可能会随着不一致的访问而破裂,只有当访问跨越cacheline时,才有可能发生。在 std :: atmoic&lt; t&gt; 实现中,当 t 的大小超过CPU读取/读取/写作的大小时,您将看到锁的使用。

另一件事 std :: Atomic 确实是在线程之间同步访问。仅仅因为一个线程将数据写入变量并不意味着另一个线程会立即看到数据。 CPU的写作将数据放入其存储缓冲区中,希望它再次被覆盖,或者相邻的内存被编写,并且可以将两个写入结合在一起。一段时间后,数据到达L1缓存可以停留更长的时间,然后是L2和L3。根据架构缓存,CACHE可能会在CPU内核之间共享也可能不会共享。它们也可能不会自动同步。因此,当您想从多个内核访问相同的内存地址时,您必须告诉CPU将访问与其他内核同步。

第三件事必须是现代CPU进行排序执行和投机执行。这意味着即使代码检查变量,然后读取第二个变量,CPU可能首先读取第二个变量。如果第一个变量充当信号信号,即准备读取第二个变量,则可能会失败,因为读取在数据准备就绪之前发生。 std :: Atomic 添加了阻止CPU进行这些重新排序的障碍,因此读取和写入硬件中的特定顺序。

第四件事大致相同,但对于编译器而言。 std :: Atomic 防止编译器在其上重新排序指令。或仅优化多个读或写入一个。

所有这些 std :: Atomic 如果您只使用任何内存订单而只使用它,则可以为您自动化。默认的内存顺序是最强的顺序。

但是,当您使用时,

uint64_t a = x.load(std::memory_order_relaxed);

您会告诉编译器忽略大多数事情:

放松操作:在其他读取或写入上没有同步或订购约束,只有此操作的原子性才能保证

因此您指示编译器不关心与其他线程或缓存同步或保留订单或保留指令的订单。您所关心的是,读取或写入未分为两个或更多的部分,您可以在其中获得混合数据。 load 将在 Store 之前从另一个线程中的 Store 之后获取整个数据。但这完全不确定您获得的两个值中的哪一个。这就是您免费免费获得的所有64位加载/存储,因此代码是相同的。

注意:如果您有多个原子,那么访问一个具有更强内存顺序的原子将使它们同步。因此,您可以看到代码将与订单较弱的其他人一起使用强大的订单进行一个负载。对于一组写作。这可以加快访问速度。但这很难正确。

std::atomic solves 4 problems.

One is that load/store is atomic, meaning you don't get loads and stores intermixed so that for example you load 32bit from before a store and the other 32bit from after a store. Normally everything up to register size is naturally atomic in that sense on the CPU itself. Things might break with unaligned access, potentially only when the access crosses a cacheline. In std::atmoic<T> implementations you will see the use of locks when the size of T exceeds the size the CPU reads/writes atomically on it's own.

The other thing std::atomic does is synchronize access between threads. Just because one thread writes data to a variable doesn't mean another thread sees that data appear instantly. The writing cpu puts the data into it's store buffer hoping it just gets overwritten again or adjacent memory gets written and the 2 writes can be combined. After a while the data goes to L1 cache where it can stay even longer, then L2 and L3. Depending on the architecture cache may or may not be shared between CPU cores. They also might not synchronize automatically. So when you want to access the same memory address from multiple cores you have to tell the CPU to synchronize the access with other cores.

The third thing has to with modern CPUs doing out-of-order execution and speculative execution. That means even if the code checks a variable and then reads a second variable the CPU might read the second variable first. If the first variable acts as a semaphore signaling the second variable is ready to be read then this can fail because the read happens before the data is ready. The std::atomic adds barriers preventing the CPU to do these reorderings so reads and writes happen in a specific order in the hardware.

The fourth thing is much the same but for the compiler. std::atomic prevents the compiler from reordering instructions across it. Or from optimizing multiple reads or writes into just one.

All of this std::atomic does automatiocaly for you if you just use it without specifying any memory order. The default memory order is the strongest order.

But when you use

uint64_t a = x.load(std::memory_order_relaxed);

you tell the compiler to ignore most of the things:

Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed

So you instructed the compiler not to care about synchronizing with other threads or caches or to preserve the order the instructions are written. All you care about is that reads or writes are not broken up into 2 or more parts where you could get mixed data. The load will get either the whole data from before the store or the whole data from after the store in the other thread. But it's completely undefined which of the two values you get. Which is what you get for all 64bit load/store for free so the code is identical.

Note: if you have multiple atomics then accessing one with a stronger memory order will synchronize both of them. So you can see code that will do one load with a strong order together with others with weak order. Same for groups of writes. This can speed up access. But it's hard to get right.

全部不再 2025-02-17 12:32:54

从C ++语言标准的意义上讲,访问是否是数据竞赛,与基础硬件无关。该语言具有自己的内存模型,即使对目标体系结构的直接编译将没有问题,编译器仍可以根据C ++内存模型的意义上没有数据竞赛的假设来优化编译器。

在没有同步的两个线程中访问两个线程中的非原子,其中一个始终是C ++模型中的数据竞赛。因此,是的,第一个程序具有数据竞赛,因此不确定。

在第二个程序中,对象是原子,因此不能有数据竞赛。

Whether or not an access is a data race in the sense of the C++ language standard is independent of the underlying hardware. The language has its own memory model and even if a straight-forward compilation to the target architecture would be free of problems, the compiler may still optimize based on the assumption that the program is free of data races in the sense of the C++ memory model.

Accessing a non-atomic in two threads without synchronization with one of them being a write is always a data race in the C++ model. So yes, the first program has a data race and therefore undefined behavior.

In the second program the object is an atomic, so there cannot be a data race.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文