删除对象后,未完成的存储会发生什么?
考虑以下简单函数(假设大多数编译器优化关闭)由具有存储缓冲区的 X86 CPU 上不同内核上的两个线程执行
struct ABC
{
int x;
//other members.
};
void dummy(int index)
{
while(true)
{
auto abc = new ABC;
abc->x = index;
cout << abc->x;
// do some other things.
delete abc;
}
}
: 1 由线程1 传递,2 由线程2 传递。 因此,线程1应该始终打印1,线程2应该始终打印2。
是否存在将x的存储放入存储缓冲区并在执行delete后提交的情况?或者是否存在隐式内存屏障来确保在删除之前提交存储?或者一旦遇到删除,任何未完成的存储都会被丢弃?
这变得很重要的情况:
由于delete将对象的内存返回到空闲列表(使用libc),因此有可能在thread1中刚刚释放的一块内存被thread2中的new运算符返回(不仅虚拟地址,甚至返回的底层物理地址也可以相同)。 如果未完成的存储可以在删除后执行,则线程2将abc->x设置为2后,线程1中的一些较旧的未完成存储可能会将其覆盖为1。
这意味着在上面的程序中,线程2可以打印1,这是绝对错误的。线程1和线程2是完全独立的,从程序员的角度来看,线程之间没有数据共享,并且它们不必担心任何同步。
我在这里缺少什么?
Consider the following simple function (assume most compiler optimizations turned off) being executed by two threads on different cores on an X86 CPU with a store buffer:
struct ABC
{
int x;
//other members.
};
void dummy(int index)
{
while(true)
{
auto abc = new ABC;
abc->x = index;
cout << abc->x;
// do some other things.
delete abc;
}
}
Here, index is the index of the thread; 1 passed by thread1 and 2 passed by thread2.
So, thread1 should always print 1 and thread2 should always print 2.
Can there be a situation where the store to x is put in the store buffer and is committed after delete is executed? Or is there an implicit memory barrier that ensures the store is committed before delete? Or is it that any outstanding stores are just discarded once delete is encountered?
Situation where this becomes important:
Since delete returns the memory of the object to the free list (with libc), it is possible that a piece of memory that was just free'd in thread1 is returned by the new operator in thread2 (not only the virtual address, even the underlying physical address returned can be the same).
If outstanding stores can execute after delete, it is possible that after thread2 sets abc->x to 2, some older outstanding store from thread1 overwrites it to 1.
This means that in the above program, thread2 can print 1 which is absolutely wrong. Thread1 and thread2 are completely independent and there is no data sharing between the threads from programmer's point of view and they should not have to worry about any synchronization.
What am I missing here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在单个线程
中,CPU必须保留一个指令的幻觉一次,以一个线程为单个线程。这是OOO Exec的基本规则。这意味着跟踪程序顺序,并且确保负载始终看到与该订单一致的值,并且最终将其写入缓存的值也一致。
这非常类似于C ++的“ AS-IF”规则,仅具有需要保留的不同可观察物。 (与CPU ISA不同,C ++在法律上可以观察到其他线程非常限制,但是可以通过重新排序源线 1 来解释编译时和运行时的内存重新订购)
。 Core Snoop the Store Buffer,如果负载正在重新加载尚未承诺的商店,则将其转发数据。
对于任何单独的内存位置,确保其修改顺序与程序订单匹配,即不会将商店重新排序到同一位置。因此,灰尘沉降后的最终值是程序顺序中的最后一个。即使是其他线程的观察,也会看到该位置的一致修改顺序。这就是为什么
std :: atomic
能够提供一个分别为每个对象存在修改顺序的保证,而没有对A进行额外更改,然后返回A返回A的if程序订单存储B,然后A. ISO A. ISO C ++可以保证这一点,因为所有实际CPU也可以保证它。munmap
的系统调用是一种特殊情况,但其他new
/delete
(和malloc
/code>免费
)就CPU而言并不特别:在免费列表上放置一个块并使其他代码分配是另一个案例,这是弄乱基于指针的数据的另一种情况结构。与往常一样,CPU跟踪任何重新排序以确保加载可查看正确的值。通过另一个线程重复使用
您并不是错了,担心这一点;仅基于CPU架构,在这里并非免费发生正确性;越野车可以弄错这一点,并准确允许您描述的问题。
内存 发生在以前的关系之前先前使用的代码,然后删除了内存,而在另一个线程中的代码则是分配的。
因此,在此内存块中的任何旧线程存储都将可以看到仅分配内存的线程。因此,他们不会踏上商店。如果新线程将该内存的指针传递到第三个线程,则它更好地使用ACQ/rel或消耗/释放同步本身来确保第三个线程看到其存储,而不是从第一个线程中存储。
如果
免费
涉及
munmap
使用syscall
指令运行的指令更改页面表的内核代码(使映射无效,以使其对其进行负载/存储会错误),它本身将提供足够的序列化。现有的CPU不会重命名特权级别,因此他们不会通过SYSCALL指令对内核进行级别的执行。尽管在X86-64
Invlpg
上已经是A serialization指令,但要由OS进行足够的内存串行。 (在x86术语中,这意味着排出Rob和Store Buffer,因此所有先前的说明都完全执行,并将其结果写回L1D Cache(用于商店)。在该TLB条目上,甚至除了开关到内核模式外。(但是,切换到内核模式并不一定会耗尽商店缓冲区;这些商店的物理地址已知。当执行商店地址UOPS时,进行了TLB检查。因此,对页面表的更改不影响该页面 的过程
脚注1:内存重新排序不是源重新排序
,内存重新排序不像在C ++源中重新排序语句或ASM机器代码中的说明一样工作;内存重新排序是关于其他线程可以观察到的内容,因为从缓存和商店中读取的负载最终会在存储缓冲区的远端提交缓存。重新排序源以解释此违反代码,违反AS-IF规则,但是内存重新订购 can 仍会产生此类效果,同时仍具有线程操作,请看到其自己的商店的正确值,例如商店的。这是因为现实世界ISA没有依次一致的内存模型。您需要额外的订单才能恢复SC。例如,即使是按住CPU管道也可以用缓存重新排序负载,例如,缓存可能会击中失误,甚至订购强的X86允许重新排序:其内存模型基本上是程序订单,以及带有商店偏向的商店缓冲区。
(在评论中进行了有关编译时间重新排序和源排序的讨论;这个问题没有这种误解。)
C ++ AS-IF规则是CPU在执行时遵循的想法,只是ISA的规则是管理的。外部可观察物的要求。否ISA的内存顺序规则与ISO C ++一样弱,例如它们都保证了连贯的共享缓存,并且许多CPU ISA没有UB。 (尽管有些人这样做,例如称其为“不可预测”的行为。更常见的是,在某些寄存器中,不可预测的或不确定的结果 ;用户/主管特权分离需要限制哪些行为是可能的,所以用户是可能的 - 空间无法运行一些不支持的指令序列,也许可以接管整个机器或崩溃
。英特尔调用商店缓冲区 +负载缓冲区的组合为内存订单缓冲区,因为它还必须检测到负载提早征收值的情况,然后才能进行架构允许(负载负载订购),但随后事实证明此核心丢失丢失的访问到缓存线。或者,如果误解了商店 - 转向,例如动态预测负载将从未知地址重新加载商店,但是事实证明该商店是不重叠的。无论哪种情况,CPU都会将级别后端倒回到一致的退休状态。 (这称为管道nuke;此特定原因由
machine_clears.memory_ordering
perf Event。Within a single thread
The CPU has to preserve the illusion of instructions executing one at a time, in program order, for a single thread. This is the cardinal rule of OoO exec. This means tracking what program order was, and making sure loads always see values consistent with that, and that values eventually written to cache are also consistent.
This is very much like C++'s "as-if" rule, just with different observables that need to be preserved. (C++ is very restrictive in what other threads are legally allowed to observe, unlike CPU ISAs, but neither compile-time nor run-time memory-reordering can be explained by reordering source lines1)
Loads by this core snoop the store buffer, forwarding data from it if the load is reloading a store that hasn't committed yet.
And for any individual memory location, making sure its modification order matches program order, i.e. not reordering stores to the same location. So the final value after the dust settles is the last one in program order. And even observation by other threads will see a consistent modification order for that location; that's why
std::atomic
is able to provide the guarantee that a modification order exists for every object separately, not having extra changes to A then B then back to A if program order stored B then A. ISO C++ can guarantee this because all real-world CPUs also guarantee it.A system call like
munmap
is a special case, but otherwisenew
/delete
(andmalloc
/free
) aren't special as far as the CPU is concerned: putting a block on the free list and having other code allocate it is just another case of messing around with pointer-based data structures. As always, the CPU tracks any reordering its doing to make sure loads see correct values.Reuse by another thread
You're not wrong to worry about this; correctness doesn't happen for free here based on CPU architecture alone; a buggy libc could get this wrong and allow exactly the problems you describe. @ixSci's answer quotes the relevant part of the C++ standard. (Compile-time ordering of memory access wrt. calls to new/delete is also necessary, but that always has to happen for any non-inline function call that the compiler doesn't know is "pure"; any function might read or write memory so it has to be in sync.)
If the memory is placed on a global free-list that could be reused by another thread, a thread-safe allocator will have used sufficient synchronization to create a C++ happens-before relationship between the code that previously used then deleted the memory, and the code in another thread that just allocated it.
So any old-thread stores into this memory block will already be visible to the thread that just allocated the memory. So they won't step on its stores. If the new thread passes a pointer to this memory to a 3rd thread, it had better use acq/rel or consume/release synchronization itself to make sure that 3rd thread sees its stores, not still stores from the first thread.
Unmapping entirely so access to that virtual address faults
If the
free
involves amunmap
that uses asyscall
instruction to run kernel code that changes page tables (to invalidate a mapping so loads/stores to it would fault), that itself will provide sufficient serialization. Existing CPUs don't rename the privilege level, so they don't do out-of-order exec into the kernel through a syscall instruction.It's up to the OS to do sufficient memory-barriering around modifying page tables, although on x86-64
invlpg
is already a serializing instruction. (In x86 terminology, that means draining the ROB and store buffer, so all previous instructions are fully done executing with their results written back to L1d cache (for stores).) So there's no possibility of it reordering with earlier loads / stores that depend on that TLB entry, even apart from the switch to kernel mode.(Switching into kernel mode doesn't necessarily drain the store buffer, though; the physical address of those stores are known. The TLB checks were done as the store-address uops were executed. So changes to the page tables don't affect the process of committing them to memory.)
Footnote 1: memory reordering isn't source reordering
BTW, memory reordering doesn't work like reordering statements in the C++ source or instructions in the asm machine code; memory reordering is about what other threads can observe as loads read from cache and stores eventually commit to cache at the far end of the store buffer. Reordering the source to try to explain this break the code, violating the as-if rule, but memory-reordering can produce such effects while still having the thread's operations see correct values for its own stores, e.g. by store-forwarding. That's because real-world ISAs don't have sequentially consistent memory models; you need extra ordering to recover SC. Even an in-order CPU pipeline can reorder loads with a cache that can hit-under-miss, for example, and even strongly-ordered x86 allows StoreLoad reordering: its memory model is basically program-order plus a store buffer with store-forwarding.
(There was discussion in comments about compile-time reordering and source ordering; the question didn't have this misconception.)
The C++ as-if rule is the same idea that CPUs follow as they execute, just that the ISA's rules are what govern the requirements on external observables. No ISA has memory-ordering rules as weak as ISO C++, e.g. they all guarantee a coherent shared cache, and many CPU ISAs don't have UB. (Although some do, e.g. calling it "unpredictable" behaviour. Much more often just an unpredictable or undefined result in some register; user/supervisor privilege separation requires there be limits on what behaviour is possible so user-space can't run some unsupported instruction sequence and maybe take over or crash the whole machine.)
Fun fact: on strongly-ordered x86 specifically, store and load ordering need to be more closely tied together than most ISAs; Intel calls the combination of store buffer + load buffer the Memory Order Buffer, because it also has to detect cases where a load took a value early, before it was architecturally allowed to (LoadLoad ordering), but then it turns out this core lost access to the cache line. Or in case of mis-speculation about store-forwarding, e.g. dynamically predicting that a load would be reloading a store from an unknown address, but then it turns out the store was non-overlapping. In either case, the CPU rewinds the out-of-order back-end back to a consistent retirement state. (This is called a pipeline nuke; this specific cause is counted by the
machine_clears.memory_ordering
perf event.)根据 C++20 (new.delete.dataraces/p1),我们有以下保证:
由于每次
删除
都发生在同一内存的任何新建
之前,那么这些运算符之前的顺序也 >发生在这些其他调用之前。对于您的示例:abc->x = index;
在delete abc;
之前排序,而delete abc;
发生在auto abc = new ABC;
之前,并且传递性abc->x = index;
发生在auto abc = new ABC;
之前。这保证了abc->x = index;
在auto abc = new ABC;
之前完成。According to C++20 (new.delete.dataraces/p1) we have the following guarantee:
Since every
delete
happens before anynew
of the same memory, then what is sequenced before these operators also happens before these other invocations. And to your example:abc->x = index;
is sequenced beforedelete abc;
which happens beforeauto abc = new ABC;
and transitivelyabc->x = index;
happens beforeauto abc = new ABC;
. That guarantees that theabc->x = index;
is complete beforeauto abc = new ABC;
.