Win32 C++ 中的无锁双端队列
我对无锁数据结构非常陌生,因此对于我编写的练习(我希望的功能)是有界无锁双端队列(尚未调整大小,只是想让基本情况正常工作)。我只是想从那些知道自己在做什么的人那里得到一些确认,以确定我是否有正确的想法和/或我如何改进这一点。
class LocklessDeque
{
public:
LocklessDeque() : m_empty(false),
m_bottom(0),
m_top(0) {}
~LocklessDeque()
{
// Delete remaining tasks
for( unsigned i = m_top; i < m_bottom; ++i )
delete m_tasks[i];
}
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
ITask* PopBottom()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedDecrement(&m_bottom);
return m_tasks[m_bottom];
}
m_empty = true;
return NULL;
}
ITask* PopTop()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedIncrement(&m_top);
return m_tasks[m_top];
}
m_empty = true;
return NULL;
}
bool IsEmpty() const
{
return m_empty;
}
private:
ITask* m_tasks[16];
bool m_empty;
volatile unsigned m_bottom;
volatile unsigned m_top;
};
I'm pretty new to lockless data structures, so for an exercise I wrote (What I hope functions as) a bounded lockless deque (No resizing yet, just want to get the base cases working). I'd just like to have some confirmation from people who know what they're doing as to whether I've got the right idea and/or how I might improve this.
class LocklessDeque
{
public:
LocklessDeque() : m_empty(false),
m_bottom(0),
m_top(0) {}
~LocklessDeque()
{
// Delete remaining tasks
for( unsigned i = m_top; i < m_bottom; ++i )
delete m_tasks[i];
}
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
ITask* PopBottom()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedDecrement(&m_bottom);
return m_tasks[m_bottom];
}
m_empty = true;
return NULL;
}
ITask* PopTop()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedIncrement(&m_top);
return m_tasks[m_top];
}
m_empty = true;
return NULL;
}
bool IsEmpty() const
{
return m_empty;
}
private:
ITask* m_tasks[16];
bool m_empty;
volatile unsigned m_bottom;
volatile unsigned m_top;
};
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看着这个,我认为这将是一个问题:
如果在实际的多线程环境中使用它,我认为在设置
m_tasks[m_bottom]
时会发生冲突。想象一下如果您有两个线程尝试同时执行此操作会发生什么 - 您无法确定哪个线程实际设置了 m_tasks[m_bottom]。查看这篇文章,这是对无锁队列的合理讨论。
Looking at this I would think this would be a problem:
If this is used in an actual multithreaded environment I would think you'd collide when setting
m_tasks[m_bottom]
. Think of what would happen if you have two threads trying to do this at the same time - you couldn't be sure of which one actually set m_tasks[m_bottom].Check out this article which is a reasonable discussion of a lock-free queue.
您使用
m_bottom
和m_top
成员来索引数组不行。您可以使用 InterlockedXxxx() 的返回值来获取安全索引。您需要丢失 IsEmpty(),它在多线程场景中永远不可能准确。 PopXxx 中的空支票也有同样的问题。我不明白如何在没有互斥锁的情况下完成这项工作。Your use of the
m_bottom
andm_top
members to index the array is not okay. You can use the return value of InterlockedXxxx() to get a safe index. You'll need to lose IsEmpty(), it can never be accurate in a multi-threading scenario. Same problem with the empty check in PopXxx. I don't see how you could make that work without a mutex.完成此类几乎不可能的任务的关键是使用 InterlockedCompareExchange。 (这是 Win32 使用的名称,但任何支持多线程的平台都会有一个 InterlockedCompareExchange 等效名称)。
这个想法是,您制作该结构的副本(该副本必须足够小以执行原子读取(64 或者如果您可以处理一些不可移植性,则在 x86 上为 128 位)。
您使用建议的更新制作另一个副本,执行您的逻辑并更新副本,然后使用 InterlockedCompareExchange 更新“真实”结构 InterlockedCompareExchange 的作用是,自动确保该值仍然是状态更新之前开始使用的值,如果仍然是该值,则自动更新该值。一般来说,这会被包裹在一个无限循环中,直到其他人没有改变中间的值为止:(
如果上面的代码实际上没有编译,请原谅 - 我写的。 现在你可以让无锁算法变得简单了
,但问题是你可以原子更新的数据量受到严重限制。
一些无锁算法使用了一种“有帮助”的技术。并发线程。例如,假设您有一个链表,您希望能够从多个线程更新该链表,其他线程可以通过对“第一个”和“最后一个”指针执行更新来“提供帮助”,如果它们正在竞争并看到它们是位于“last”指向的节点,但该节点中的“next”指针不为空。在此示例中,在注意到“最后一个”指针错误时,仅当最后一个指针仍然指向当前节点时,它们才会使用互锁比较交换来更新该指针。
不要陷入“旋转”或循环(如自旋锁)的陷阱。虽然短暂旋转是有价值的,因为您期望“其他”线程完成某些事情 - 但它们可能不会。 “其他”线程可能已进行上下文切换并且可能不再运行。你只是在消耗 CPU 时间,通过旋转直到条件成立来消耗电力(可能会耗尽笔记本电脑的电池)。当你开始旋转的那一刻,你不妨放弃你的无锁代码并用锁来编写它。锁定比无限制旋转更好。
只是从困难到荒谬,考虑一下您在其他架构中可能陷入的混乱 - x86/x64 上的情况通常相当宽容,但是当您进入其他“弱有序”架构时,您就会陷入这样的情况:没有意义 - 内存更新不会按程序顺序发生,因此您对其他线程正在执行的操作的所有心理推理都会消失。 (甚至 x86/x64 也有一种称为“写入组合”的内存类型,在更新视频内存时经常使用,但可以用于任何需要栅栏的内存缓冲区硬件)这些架构要求您使用“内存栅栏”操作来保证栅栏之前的所有读/写/两者都将是全局可见的(由其他核心)。写栅栏保证栅栏之前的任何写操作在栅栏之后的任何写操作之前都是全局可见的。读栅栏将保证栅栏之后的读取不会在栅栏之前推测执行。读/写栅栏(又名完整栅栏或内存栅栏)将同时保证这两个功能。栅栏非常昂贵。 (有些人使用术语“屏障”而不是“栅栏”)
我的建议是首先使用锁/条件变量来实现它。如果您在使其完美工作方面遇到任何困难,那么尝试进行无锁实现是没有希望的。并且始终测量、测量、测量。您可能会发现使用锁的实现的性能非常好 - 没有一些片状无锁实现的不确定性,这些实现带有令人讨厌的挂起错误,该错误只会在您向重要客户进行演示时才会出现。也许您可以通过将原始问题重新定义为更容易解决的问题来解决问题,也许可以通过重组工作,使更大的项目(或批量项目)进入集合,从而减轻整个事情的压力。
编写无锁并发算法非常困难(我确信您在其他地方已经看到了 1000 倍的编写)。通常也不值得付出努力。
The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).
The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).
You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:
(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)
Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.
Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.
Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.
Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")
My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.
Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.
为了解决 Aaron 指出的问题,我会做类似的事情:
同样,对于 pop:
我会从设计中完全消除
m_empty
和IsEmpty()
。 IsEmpty 返回的结果受到竞争条件的影响,这意味着当您查看该结果时,它很可能已经过时(即,当您查看它返回的内容时,它告诉您有关队列的信息可能是错误的)。同样,m_empty
只提供了没有它就已经可用的信息记录,即生成陈旧数据的方法。使用m_empty
并不能保证它不能正常工作,但它确实会大大增加出现错误的机会,IMO。我猜这是由于代码的临时性质造成的,但现在您在数组边界方面也遇到了一些严重的问题。您没有做任何事情来强制数组索引回绕,因此一旦您尝试将第 17 个任务推入队列,就会遇到一个大问题。
编辑:我应该指出,评论中指出的竞争条件非常严重——而且它也不是唯一的。虽然比原来的要好一些,但这不应该被误认为可以正常工作的代码。
我想说,编写正确的无锁代码比编写正确的使用锁的代码要困难得多。我不知道有谁在没有充分理解使用锁定的代码的情况下这样做的。基于原始代码,我想说,最好从编写和理解使用锁的队列的代码开始,并且只有当您使用它来更好地理解所涉及的问题时,才能真正做到这一点。无锁代码的尝试。
Addressing the problem pointed out by Aaron, I'd do something like:
Likewise, to pop:
I'd eliminate both
m_empty
andIsEmpty()
from the design completely. The result returned by IsEmpty is subject to a race condition, meaning by the time you look at that result, it may well be stale (i.e. what it tells you about the queue may be wrong by the time you look at what it returned). Likewise,m_empty
provides nothing but a record of information that's already available without it, a recipe for producing stale data. Usingm_empty
doesn't guarantee it can't work right, but it does increase the chances of bugs considerably, IMO.I'm guessing it's due to the interim nature of the code, but right now you also have some serious problems with the array bounds. You aren't doing anything to force your array indexes to wrap around, so as soon as you try to push the 17th task onto the queue, you're going to have a major problem.
Edit: I should point out that the race condition noted in the comment is quite serious -- and it's not the only one either. Although somewhat better than the original, this should not be mistaken for code that will work correctly.
I'd say that writing correct lock-free code is considerably more difficult than writing correct code that uses locks. I don't know of anybody who has done so without a solid understanding of code that does use locking. Based on the original code, I'd say it would be much better to start by writing and understanding code for a queue that does use locks, and only when you've used that to gain a much better understanding of the issues involved really make an attempt at lock-free code.