C++ 中线程/共享内存之间的线程安全数据交换在Linux上

发布于 2024-12-11 03:12:40 字数 1490 浏览 0 评论 0原文

我有点困惑: 在生产中,我们有两个进程通过共享内存进行通信,数据交换的一部分是 long 和 bool。对此数据的访问不同步。它已经运行了很长一段时间并且仍然运行良好。我知道修改值不是原子的,但考虑到这些值被修改/访问数百万次,这一定会失败吗?

这是一段示例代码,它在两个线程之间交换数字:

#include <pthread.h>
#include <xmmintrin.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

//volatile uint64 s1 = 0;
//volatile uint64 s2 = 0;
uint64 s1 = 0;
uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
        _mm_pause();// busy spin
        }
        //value = __sync_add_and_fetch(&s2, 1);
        value = ++s2;
    }
 }

 int main (int argc, char *argv[])
 {
     pthread_t threads[1];
     pthread_create(&threads[0], NULL, run, NULL);

     register uint64 value = s1;
     while (s1 < ITERATIONS)
     {
         while (s2 != value)
         {
        _mm_pause();// busy spin
         }
        //value = __sync_add_and_fetch(&s1, 1);
        value = ++s1;
      }
}

如您所见,我注释掉了几件事:

//volatile uint64 s1 = 0;

//值 = __sync_add_and_fetch(&s1, 1);

__sync_add_and_fetch 以原子方式递增变量。

我知道这不是很科学,但是在没有同步功能的情况下运行几次它工作得很好。此外,如果我测量两个版本同步和不同步,它们以相同的速度运行,为什么 __sync_add_and_fetch 没有增加任何开销?

我的猜测是编译器保证这些操作的原子性,因此我在生产中没有看到问题。但仍然无法解释为什么 __sync_add_and_fetch 没有增加任何开销(即使在调试中运行)。

有关矿场环境的更多详细信息: ubuntu 10.04,gcc4.4.3 英特尔 i5 多核 CPU。

生产环境类似,只是运行在更强大的 CPU 和 Centos 操作系统上。

感谢您的帮助

I got a "bit" confused:
In production we have two processes communicating via shared memory, a part of data exchange is a long and a bool. The access to this data is not synchronized. It's been working fine for a long time and still is. I know modifying a value is not atomic, but considering that these values are modified/accessed millions of times this had to fail?

Here is a sample piece of code, which exchanges a number between two threads:

#include <pthread.h>
#include <xmmintrin.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

//volatile uint64 s1 = 0;
//volatile uint64 s2 = 0;
uint64 s1 = 0;
uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
        _mm_pause();// busy spin
        }
        //value = __sync_add_and_fetch(&s2, 1);
        value = ++s2;
    }
 }

 int main (int argc, char *argv[])
 {
     pthread_t threads[1];
     pthread_create(&threads[0], NULL, run, NULL);

     register uint64 value = s1;
     while (s1 < ITERATIONS)
     {
         while (s2 != value)
         {
        _mm_pause();// busy spin
         }
        //value = __sync_add_and_fetch(&s1, 1);
        value = ++s1;
      }
}

as you can see I have commented out couple things:

//volatile uint64 s1 = 0;

and

//value = __sync_add_and_fetch(&s1, 1);

__sync_add_and_fetch atomically increments a variable.

I know this is not very scientific, but running a few times without sync functions it works totally fine. Furthermore if I measure both versions sync and without sync they run at the same speed, how come __sync_add_and_fetch is not adding any overhead?

My guess is that compiler is guaranteeing atomicity for these operations and therefore I don't see a problem in production. But still cannot explain why __sync_add_and_fetch is not adding any overhead (even running in debug).

Some more details about mine environment:
ubuntu 10.04, gcc4.4.3
intel i5 multicore cpu.

Production environment is similar it's just running on more powerful CPU's and on Centos OS.

thanks for your help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

dawn曙光 2024-12-18 03:12:40

基本上你会问“为什么我看不出两者之间的行为/性能有什么区别

s2++;

__sync_add_and_fetch(&s2, 1);

好吧,如果你去看看编译器在这两种情况下生成的实际代码,你会发现存在差异 - < code>s2++ 版本将有一个简单的 INC 指令(或者可能是 ADD),而 __sync 版本将在该指令上有一个 LOCK 前缀。

那么为什么它在没有 LOCK 前缀的情况下也能工作呢?嗯,一般来说,LOCK 前缀是这是在任何基于 x86 的系统上工作所必需的,但事实证明,对于基于 Intel Core 的芯片来说,只需要在总线上的不同 CPU 之间进行同步(即使在单个 CPU 上运行)。多核),

那么为什么在 __sync 情况下它会进行内部同步呢?Core i7 是一个“有限”的芯片,因为它只支持单插槽系统。 ,所以你不能有这意味着永远不需要 LOCK,事实上 CPU 完全忽略它。现在代码增加了 1 个字节,这意味着如果您的 ifetch 或解码受到限制,它可能会产生影响,但您没有,所以您看不到任何差异。

如果您要在多插槽 Xeon 系统上运行,您会看到 LOCK 前缀的速度(小幅)下降,并且还可能在非 LOCK 版本中看到(罕见的)故障。

Basically you're asking "why do I see no difference in behavior/performance between

s2++;

and

__sync_add_and_fetch(&s2, 1);

Well, if you go and look at the actual code generated by the compiler in these two cases, you will see that there IS a difference -- the s2++ version will have a simple INC instruction (or possibly an ADD), while the __sync version will have a LOCK prefix on that instruction.

So why does it work without the LOCK prefix? Well, while in general, the LOCK prefix is required for this to work on ANY x86-based system, it turns out its not needed for yours. With Intel Core based chips, the LOCK is only needed to synchronize between different CPUs over the bus. When running on a single CPU (even with multiple cores), it does its internal synchronization without it.

So why do you see no slowdown in the __sync case? Well, a Core i7 is a 'limited' chip in that it only supports single socket systems, so you can't have multiple CPUs. Which means the LOCK is never needed and in fact the CPU just ignores it completely. Now the code is 1 byte larger, which means it could have an impact if you were ifetch or decode limited, but you're not, so you see no difference.

If you were to run on a multi-socket Xeon system, you would see a (small) slowdown for the LOCK prefix, and could also see (rare) failures in the non-LOCK version.

彼岸花似海 2024-12-18 03:12:40

我认为编译器不会生成原子性,除非您使用一些特定于编译器的模式,所以这是不行的。

如果只有两个进程使用共享内存,通常不会出现问题,特别是如果代码片段足够短。操作系统更喜欢阻止一个进程并在最佳时运行另一个进程(例如 I/O),因此它将运行一个进程到一个好的隔离点,然后切换到下一个进程。

尝试运行同一应用程序的几个实例,看看会发生什么。

I think compiler generates no atomicity until you use some compiler-specific patterns, so thats a no-go.

If only two processes are using the shared memory, usually no problems would occur, Specially if code snippets are short enough. Operating system prefers to block one process and run another when its best (e.g I/O), So it will run one to a good point of isolation, then switch to the next.

Try running a few instances of the same application and see what happens.

秋心╮凉 2024-12-18 03:12:40

我看到您正在使用 Martin Thompson 线程间延迟示例。

我的猜测是编译器保证这些操作的原子性,因此我在生产中没有看到问题。但仍然无法解释为什么 __sync_add_and_fetch 没有增加任何开销(即使在调试中运行)。

编译器在这里不保证任何东西。您运行的 X86 平台是。这段代码在时髦的硬件上可能会失败。

不确定你在做什么,但 C++11 确实提供了 std::atomic 的原子性。您还可以查看 boost::atomic。我假设你对 Disruptor 模式感兴趣,我会无耻地将我的端口插入 C++,称为 disruptor--< /a>.

I see you're using Martin Thompson inter-thread-latency example.

My guess is that compiler is guaranteeing atomicity for these operations and therefore I don't see a problem in production. But still cannot explain why __sync_add_and_fetch is not adding any overhead (even running in debug).

The compiler doesn't guarentee anything here. The X86 platform you're running on is. This code will probably fail on funky hardware.

Not sure what you're doing, but C++11 does provide atomicity with std::atomic. You can also have a look at boost::atomic. I assume you're interested in the Disruptor pattern, I'll shamelessly plug my port to C++ called disruptor--.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文