C 中是否有单消费者单生产者无锁队列实现?

发布于 2024-07-24 10:20:26 字数 589 浏览 4 评论 0 原文

我正在编写一个带有消费者线程和生产者线程的程序,现在看来队列同步在程序中是一个很大的开销,我寻找了一些无锁队列实现,但只找到了Lamport的版本和PPoPP上的改进版本' 08:

enqueue_nonblock(data) {
    if (NULL != buffer[head]) {
        return EWOULDBLOCK;
    }
    buffer[head] = data;
    head = NEXT(head);
    return 0;
}

dequeue_nonblock(data) {
    data = buffer[tail];
    if (NULL == data) {
        return EWOULDBLOCK;
    }
    buffer[tail] = NULL;
    tail = NEXT(tail);
    return 0;
}

两个版本都需要为数据预先分配数组,我的问题是是否有任何使用 malloc() 动态分配空间的单消费者单生产者无锁队列实现?

另一个相关问题是,如何测量队列同步中的精确开销? 比如pthread_mutex_lock()需要多少时间等。

I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:

enqueue_nonblock(data) {
    if (NULL != buffer[head]) {
        return EWOULDBLOCK;
    }
    buffer[head] = data;
    head = NEXT(head);
    return 0;
}

dequeue_nonblock(data) {
    data = buffer[tail];
    if (NULL == data) {
        return EWOULDBLOCK;
    }
    buffer[tail] = NULL;
    tail = NEXT(tail);
    return 0;
}

Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?

And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

不再让梦枯萎 2024-07-31 10:20:26

如果您担心性能,那么添加 malloc() 不会有任何帮助。 如果您不担心性能,为什么不简单地通过互斥体控制对队列的访问。 您是否实际测量过此类实现的性能? 在我看来,您似乎正在走熟悉的过早优化路线。

If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.

爱冒险 2024-07-31 10:20:26

您展示的算法能够正常工作,因为尽管两个线程共享资源(即队列),但它们以非常特殊的方式共享资源。 因为只有一个线程会更改队列的头索引(生产者),并且只有一个线程每次会更改尾索引(当然是消费者),所以您无法获得共享对象的不一致状态。 同样重要的是,生产者在更新头索引之前将实际数据放入,并且消费者在更新尾索引之前读取它想要的数据。

它工作得很好,因为数组是相当静态的; 两个线程都可以依赖那里的元素的存储。 您可能无法完全替换该数组,但您可以做的是更改该数组的用途。

即,不是将数据保存在数组中,而是使用它来保存指向数据的指针。 然后,您可以 malloc() 和 free() 数据项,同时通过数组在线程之间传递对它们的引用(指针)。

此外,posix 确实支持读取纳秒时钟,尽管实际精度取决于系统。 您可以在之前和之后读取这个高分辨率时钟,然后减去即可。

The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.

It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.

I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.

Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.

晨曦÷微暖 2024-07-31 10:20:26

是的。

存在许多无锁多读多写队列。

我已经实现了 Michael 和 Scott 1996 年论文中的一个。

我将(经过更多测试)发布一个小型无锁数据结构库(C 语言),其中将包含此队列。

Yes.

There exist a number of lock-free multiple-reader multiple-writer queues.

I have implemented one, by Michael and Scott, from their 1996 paper.

I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.

唐婉 2024-07-31 10:20:26

你应该看看 FastFlow 库

You should look at FastFlow library

像极了他 2024-07-31 10:20:26

我记得几年前看到过一个看起来很有趣的东西,尽管现在我似乎找不到了。 :( 提出的无锁实现确实需要使用 CAS 原语,尽管即使锁定实现(如果您不想使用 CAS 原语)也具有相当好的性能特征——锁只能阻止多个读取器或多个生产者同时命中队列,但生产者仍然永远不会 。

我确实记得队列背后的基本概念是创建一个链表,其中总是有一个额外的“空”节点,这个额外的节点意味着列表的头和尾指针永远不会 当列表为空时引用相同的数据我希望我能找到这篇论文,我的解释没有公正地解释算法......

啊哈!

我找到了转录该算法,没有本文的其余部分。 这可能是一个有用的起点。

I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.

I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...

AH-ha!

I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.

亽野灬性zι浪 2024-07-31 10:20:26

我已经使用了一个相当简单的队列实现,可以满足您的大部分标准。 它使用静态最大大小字节池,然后我们在其中实现消息。 有一个进程将移动的头指针,以及另一进程将移动的尾指针。

仍然需要锁,但我们使用了 Peterson 的 2 处理器算法,该算法非常轻量级因为它不涉及系统调用。 只有非常小的、边界明确的区域才需要锁:最多几个 CPU 周期,因此永远不会阻塞太久。

I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.

Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.

懵少女 2024-07-31 10:20:26

我认为分配器可能是一个性能问题。 您可以尝试使用自定义多线程内存分配器,它使用链接列表来维护已释放的块。 如果您的块大小不(几乎)相同,您可以实现“Buddy 系统内存分配器”,速度非常快。 您必须将队列(环形缓冲区)与互斥体同步。

为了避免过多的同步,您可以尝试在每次访问时向队列写入多个值或从队列读取多个值。

如果您仍然想使用无锁算法,则必须使用预分配数据或使用无锁分配器。
有一篇关于无锁分配器“可扩展无锁动态内存分配”的论文,以及一个实现 Streamflow

在开始使用无锁内容之前,请查看:循环无锁缓冲区

I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.

To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.

If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator.
There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow

Before starting with Lock-free stuff, look at:Circular lock-free buffer

泪冰清 2024-07-31 10:20:26

添加 malloc 会消除您可能获得的任何性能提升,并且基于锁的结构同样有效。 之所以如此,是因为 malloc 需要某种堆上的 CAS 锁,因此某些形式的 malloc 有自己的锁,因此您可能会在内存管理器中锁定。

要使用 malloc,您需要预先分配所有节点并使用另一个队列管理它们...

请注意,您可以制作某种形式的可扩展数组,如果它被扩展,则需要锁定。

此外,虽然互锁在 CPU 上是无锁的,但它们确实会在指令执行期间放置内存锁和块内存,并且经常会导致管道停顿。

Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.

To use malloc you would need to pre allocate all the nodes and manage them with another queue...

Note you can make some form of expandable array which would need to lock if it was expanded.

Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.

相权↑美人 2024-07-31 10:20:26

此实现使用 C++ 的 new 和 delete ,可以使用 malloc 和 free 将其轻松移植到 C 标准库:

http://www.drdobbs.com/parallel/writing-lock-free-code-a- Corrected-queue/210604448?pgno=2

This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:

http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文