这些记忆障碍有必要吗？

发布于 2025-01-17 11:23:31 字数 1533 浏览 3 评论 0原文

我遇到了Singleton的get_instance函数的以下实现：

template<typename T>
T* Singleton<T>::get_instance()
{
    static std::unique_ptr<T> destroyer;

    T* temp = s_instance.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);

    if (temp == nullptr) 
    {
        std::lock_guard<std::mutex> lock(s_mutex);
        temp = s_instance.load(std::memory_order_relaxed);/* read current status of s_instance */
        if (temp == nullptr) 
        {
            temp = new T;
            
            destroyer.reset(temp);
            std::atomic_thread_fence(std::memory_order_release);
            s_instance.store(temp, std::memory_order_relaxed);
        }
    }
    
    return temp;
}

我想知道 - 在那里获取和发布内存障碍中是否有任何值？据我所知，内存屏障的目的是防止对2个不同变量之间的内存操作重新排序。让我们举一个经典示例：（

这全都在伪代码中 - 不要在语法上捕获）

# Thread 1
while(f == 0);
print(x)

# Thread 2
x = 42;
f = 1;

在这种情况下，我们要防止在线程2中重新排序2个商店操作，并重新排序2个加载操作在线程1中。因此，我们插入障碍：

# Thread 1
while(f == 0)
acquire_fence
print(x)

# Thread 2
x = 42;
release_fence
f = 1;

但是在上面的代码中，围栏的好处是什么？

编辑

这些情况之间的主要区别，因为我看到的是，在经典示例中，我们使用内存障碍，因为我们处理 2个变量 - 因此，我们具有线程2存储f 在存储x之前，或者在加载x的线程中具有危险，然后加载f。

但是在我的Singleton代码中，记忆障碍旨在预防的内存是什么？

请注意，

我知道还有其他方法可以实现这一目标，我的问题是出于教育目的 - 我正在学习记忆障碍，并且很好奇知道在这种特殊情况下它们是否有用。因此，这里的所有其他事情都与这种方式无关，请忽略。

原文

I encountered the following implementation of Singleton's get_instance function:

template<typename T>
T* Singleton<T>::get_instance()
{
    static std::unique_ptr<T> destroyer;

    T* temp = s_instance.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);

    if (temp == nullptr) 
    {
        std::lock_guard<std::mutex> lock(s_mutex);
        temp = s_instance.load(std::memory_order_relaxed);/* read current status of s_instance */
        if (temp == nullptr) 
        {
            temp = new T;
            
            destroyer.reset(temp);
            std::atomic_thread_fence(std::memory_order_release);
            s_instance.store(temp, std::memory_order_relaxed);
        }
    }
    
    return temp;
}

And I was wondering - is there any value in the acquire and release memory barriers there? As far as I know - memory barriers are aimed to prevent reordering of memory operations between 2 different variables. Let's take the classic example:

(This is all in pseudo-code - don't be caught on syntax)

# Thread 1
while(f == 0);
print(x)

# Thread 2
x = 42;
f = 1;

In this case, we want to prevent the reordering of the 2 store operations in Thread 2, and the reordering of the 2 load operations in Thread 1. So we insert barriers:

# Thread 1
while(f == 0)
acquire_fence
print(x)

# Thread 2
x = 42;
release_fence
f = 1;

But in the above code, what is the benefit of the fences?

EDIT

The main difference between those cases as I see it is that, in the classic example, we use memory barriers since we deal with 2 variables - so we have the "danger" of Thread 2 storing f before storing x, or alternatively having the danger in Thread 1 of loading x before loading f.

But in my Singleton code, what is the possible memory reordering that the memory barriers aim to prevent?

NOTE

I know there are other ways (and maybe better) to achieve this, my question is for educational purposes - I'm learning about memory barriers and curious to know if in this particular case they are useful. So all other things here not relevant for this manner please ignore.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

棒棒糖 2025-01-24 11:23:31

该模式的复杂性（命名为双检查锁定或DCLP）是，数据同步可以以2种不同的方式发生（取决于时，读者访问singleton），并且它们有点重叠。<<<<<<<<<<<<<<<<<<<<<<<< br>
但是，由于您要问栅栏，所以让我们跳过静音部分。

但是在我的单例代码中，记忆障碍旨在防止内存的内存是什么？

这与您的伪代码没有什么不同，您已经注意到获得和释放围栏是必要的，以保证42.
。
f用作信号变量，x最好不要重新排序。

在DCL模式中，第一个线程分配内存：temp = new T;
内存temp指向，其他线程将访问，因此必须同步（即其他线程可见）。
释放围栏然后是放松的商店，保证在商店之前订购新的操作，以便其他线程会观察到相同的订单。
因此，一旦将指针写入原子s_instance，而其他线程则从s_instance读取地址，他们也将对它指向的内存具有可见性。

收购栅栏做同样的事情，但顺序相反；它可以保证，在放松的负载和围栏（即访问内存）之后测序的所有内容都不会可见分配此内存的线程。
这样，将内存分配在一个线程中并在另一个线程中使用它不会重叠。

在另一个答案，，，我试图用图表来形象化这一点。

请注意，这些围栏总是成对的，尽管您也可以使用（和混合）围栏与释放/获取操作一起使用（并混合）栅栏，但释放围栏是毫无意义的。

s_instance.store(temp, std::memory_order_release); // no standalone fence necessary

DCLP的成本在于，每种用途（在每个线程中）都涉及一个负载/获取，至少需要未取代的负载（即来自L1 CACHE的负载）。
这就是为什么C ++ 11中的静态对象（可能使用DCLP实现）可能比C ++ 98（无内存模型）慢。

有关DCLP的更多信息，请检查本文来自Jeff Prehing。

The complexity of this pattern (named double-checked-locking, or DCLP) is that data synchronization can happen in 2 different ways (depending on when a reader accesses the singleton) and they kind of overlap.
But since you're asking about fences, let's skip the mutex part.

But in my Singleton code, what is the possible memory reordering that the memory barriers aim to prevent?

This is not very different from your pseudo code where you already noticed that the acquire and release fences are necessary to guarantee the outcome of 42.
f is used as the signalling variable and x better not be reordered with it.

In the DCL pattern, the first thread gets to allocate memory: temp = new T;
The memory temp is pointing at, is going to be accessed by other threads, so it must be synchronized (ie. visible to other threads).
The release fence followed by the relaxed store guarantees that the new operation is ordered before the store such that other threads will observe the same order.
Thus, once the pointer is written to the atomic s_instance and other threads read the address from s_instance, they will also have visibility to the memory it is pointing at.

The acquire fence does the same thing, but in opposite order; it guarantees that everything that is sequenced after the relaxed load and fence (ie. accessing the memory) will not be visible to the thread that allocated this memory.
This way, allocating memory in one thread and using it in another will not overlap.

In another answer, I tried to visualize this with a diagram.

Note that these fences always come in pairs, a release fence without acquire is meaningless, although you could also use (and mix) fences with release/acquire operations.

s_instance.store(temp, std::memory_order_release); // no standalone fence necessary

The cost of DCLP is that every use (in every thread) involves a load/acquire, which at a minimum requires an unoptimized load (ie. load from L1 cache).
This is why static objects in C++11 (possibly implemented with DCLP) might be slower than in C++98 (no memory model).

For more information about DCLP, check this article from Jeff Preshing.

回复收藏 0 原文

~没有更多了~