能否在 Linux 上实现正确的故障安全进程共享屏障?
在过去的问题中,我询问了如何在没有破坏竞赛的情况下实现 pthread 屏障:
如何在 pthread_barrier_wait 返回时立即销毁屏障?
并从 Michael Burr 那里收到了针对进程本地屏障的完美解决方案,但该解决方案失败了进程共享的障碍。我们后来提出了一些想法,但从未得出令人满意的结论,甚至没有开始讨论资源故障案例。
在Linux上是否可以制作一个满足这些条件的屏障:
- 进程共享(可以在任何共享内存中创建)。
- 在屏障等待函数返回后,可以安全地立即从任何线程取消映射或销毁屏障。
- 不会因资源分配失败而失败。
迈克尔尝试解决进程共享的情况(请参阅链接的问题)有一个不幸的特性,即必须在等待时分配某种系统资源,这意味着等待可能会失败。并且不清楚当屏障等待失败时调用者可以合理地做什么,因为屏障的全部意义在于,在剩余的 N-1 线程到达它之前继续进行是不安全的...
内核-空间解决方案可能是唯一的方法,但即使这样也很困难,因为信号可能会中断等待,而没有可靠的方法来恢复等待......
In a past question, I asked about implementing pthread barriers without destruction races:
How can barriers be destroyable as soon as pthread_barrier_wait returns?
and received from Michael Burr with a perfect solution for process-local barriers, but which fails for process-shared barriers. We later worked through some ideas, but never reached a satisfactory conclusion, and didn't even begin to get into resource failure cases.
Is it possible on Linux to make a barrier that meets these conditions:
- Process-shared (can be created in any shared memory).
- Safe to unmap or destroy the barrier from any thread immediately after the barrier wait function returns.
- Cannot fail due to resource allocation failure.
Michael's attempt at solving the process-shared case (see the linked question) has the unfortunate property that some kind of system resource must be allocated at wait time, meaning the wait can fail. And it's unclear what a caller could reasonably do when a barrier wait fails, since the whole point of the barrier is that it's unsafe to proceed until the remaining N-1
threads have reached it...
A kernel-space solution might be the only way, but even that's difficult due to the possibility of a signal interrupting the wait with no reliable way to resume it...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Linux futex API 不可能做到这一点,我认为这也可以得到证明。
这里本质上是一种场景,其中 N 个进程必须被一个最终进程可靠地唤醒,并且在最终唤醒后任何进程都不能接触任何共享内存(因为它可能会被异步销毁或重用)。虽然我们可以很容易地唤醒所有进程,但基本的竞争条件是在唤醒和等待之间;如果我们在等待之前发出唤醒,则落后者永远不会醒来。
解决此类问题的通常方法是让落后者在等待时原子地检查状态变量;如果唤醒已经发生,这允许它完全避免休眠。然而,我们不能在这里这样做——一旦唤醒成为可能,触摸共享内存就是不安全的!
另一种方法是实际检查所有进程是否已进入睡眠状态。然而,这对于 Linux futex API 来说是不可能的;服务员数量的唯一指示是 FUTEX_WAKE 的返回值;如果返回的服务员数量少于您预期的数量,您就知道有些服务员还没有睡觉。然而,即使我们发现我们没有唤醒足够的服务员,采取任何行动都为时已晚 - 唤醒的进程之一可能已经破坏了屏障!
因此,不幸的是,这种可立即销毁的原语无法使用 Linux futex API 构建。
请注意,在一名服务员、一名唤醒者的具体情况下,也许可以解决该问题;如果 FUTEX_WAKE 返回零,我们知道实际上没有人被唤醒,所以你有机会恢复。然而,将其变成一个有效的算法是相当棘手的。
向 futex 模型添加一个强大的扩展来解决这个问题是很棘手的。基本问题是,我们需要知道 N 个线程何时成功进入等待状态,并自动唤醒它们。然而,任何这些线程都可能随时等待运行信号处理程序 - 事实上,唤醒线程也可能等待信号处理程序。
然而,一种可行的方法是对 NT API 中的键控事件模型进行扩展。对于键控事件,线程成对地从锁中释放;如果您有一个没有“等待”的“释放”,则“释放”调用会阻塞“等待”。
由于信号处理程序的问题,这本身还不够。但是,如果我们允许“release”调用指定要自动唤醒的线程数量,那么这是可行的。您只需让屏障中的每个线程递减一个计数,然后“等待”该地址上的键控事件即可。最后一个线程“释放”N - 1 个线程。在所有 N-1 个线程都进入此键控事件状态之前,内核不允许处理任何唤醒事件;如果任何线程由于信号(包括释放线程)而离开 futex 调用,则这将完全阻止任何唤醒,直到所有线程都返回。
This is not possible with the Linux futex API, and I think this can be proven as well.
We have here essentially a scenario in which N processes must be reliably awoken by one final process, and further no process may touch any shared memory after the final awakening (as it may be destroyed or reused asynchronously). While we can awaken all processes easily enough, the fundamental race condition is between the wakeup and the wait; if we issue the wakeup before the wait, the straggler never wakes up.
The usual solution to something like this is to have the straggler check a status variable atomically with the wait; this allows it to avoid sleeping at all if the wakeup has already occurred. However, we cannot do this here - as soon as the wakeup becomes possible, it is unsafe to touch shared memory!
One other approach is to actually check if all processes have gone to sleep yet. However, this is not possible with the Linux futex API; the only indication of number of waiters is the return value from FUTEX_WAKE; if it returns less than the number of waiters you expected, you know some weren't asleep yet. However, even if we find out we haven't woken enough waiters, it's too late to do anything - one of the processes that did wake up may have destroyed the barrier already!
So, unfortunately, this kind of immediately-destroyable primitive cannot be constructed with the Linux futex API.
Note that in the specific case of one waiter, one waker, it may be possible to work around the problem; if FUTEX_WAKE returns zero, we know nobody has actually been awoken yet, so you have a chance to recover. Making this into an efficient algorithm, however, is quite tricky.
It's tricky to add a robust extension to the futex model that would fix this. The basic problem is, we need to know when N threads have successfully entered their wait, and atomically awaken them all. However, any of those threads may leave the wait to run a signal handler at any time - indeed, the waker thread may also leave the wait for signal handlers as well.
One possible way that may work, however, is an extension to the keyed event model in the NT API. With keyed events, threads are released from the lock in pairs; if you have a 'release' without a 'wait', the 'release' call blocks for the 'wait'.
This in itself isn't enough due to the issues with signal handlers; however, if we allow for the 'release' call to specify a number of threads to be awoken atomically, this works. You simply have each thread in the barrier decrement a count, then 'wait' on a keyed event on that address. The last thread 'releases' N - 1 threads. The kernel doesn't allow any wake event to be processed until all N-1 threads have entered this keyed event state; if any thread leaves the futex call due to signals (including the releasing thread), this prevents any wakeups at all until all threads are back.
在与 bdonlan 在 SO 聊天上进行了长时间的讨论之后,我想我有一个解决方案。基本上,我们将问题分解为两个自同步释放问题:销毁操作和取消映射。
处理销毁很简单:只需让 pthread_barrier_destroy 函数等待所有等待者停止检查屏障即可。这可以通过在屏障中设置使用计数、在进入/退出等待函数时自动递增/递减以及让销毁函数旋转等待计数达到零来完成。 (如果您在使用计数的高位或类似位置粘贴一个服务员标志,也可以在这里使用 futex,而不仅仅是旋转。)
处理取消映射也很容易,但不是本地的:确保 munmap通过向系统调用包装器添加锁定,当屏障等待者正在退出过程中时,带有
MAP_FIXED
标志的 或mmap
不会发生。这需要一种特殊的读写锁。最后一个到达屏障的等待者应该获取 munmap rw-lock 上的读锁,该锁将在最后一个等待者退出时释放(当递减用户计数导致计数为 0 时)。通过使编写器锁递归,可以使 munmap 和 mmap 成为可重入的(正如某些程序可能期望的那样,即使 POSIX 不需要它)。实际上,一种读者和写者完全对称的锁,并且每种类型的锁都排除相反类型的锁,但不包括相同类型的锁,应该效果最好。After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.
Handling destruction is easy: Simply make the
pthread_barrier_destroy
function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)Handling unmapping is also easy, but non-local: ensure that
munmap
ormmap
with theMAP_FIXED
flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on themunmap
rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0).munmap
andmmap
can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.好吧,我想我可以用一种笨拙的方法来做到这一点......
让“障碍”成为它自己在套接字上侦听的进程。将barrier_wait实现为:
一旦N个线程正在等待,屏障进程就会告诉所有线程继续进行。然后每个等待者关闭与屏障进程的连接并继续。
将barrier_destroy实现为:
一旦所有连接都关闭并且屏障进程被告知要消失,它就会退出。
[编辑:当然,这会分配并销毁套接字作为等待和释放操作的一部分。但我认为你可以不这样做而实现相同的协议;见下文。]
第一个问题:这个协议真的有效吗?我想是的,但也许我不明白这些要求。
第二个问题:如果它确实有效,是否可以在没有额外过程开销的情况下进行模拟?
我相信答案是“是”。您可以让每个线程在适当的时间“扮演”屏障进程的角色。您只需要一个主互斥锁,由当前“扮演”屏障进程角色的线程持有。详细信息,详细信息...好吧,barrier_wait 可能如下所示:
这里
master_mutex
(互斥锁)、master_condition_variable
(条件变量)、waiter_count
(一个无符号整数)、N
(另一个无符号整数)和time_to_die
(布尔值)都是由barrier_init分配和初始化的共享状态。waiter_count
初始化为零,time_to_die
初始化为 false,N
初始化为屏障正在等待的线程数。那么barrier_destroy将是:
不确定有关信号处理等的所有细节......但我认为“最后一个关闭灯”的基本思想是可行的。
Well, I think I can do it with a clumsy approach...
Have the "barrier" be its own process listening on a socket. Implement barrier_wait as:
Once N threads are waiting, the barrier process tells all of them to proceed. Each waiter then closes its connection to the barrier process and continues.
Implement barrier_destroy as:
Once all connections are closed and the barrier process has been told to go away, it exits.
[Edit: Granted, this allocates and destroys a socket as part of the wait and release operations. But I think you can implement the same protocol without doing so; see below.]
First question: Does this protocol actually work? I think it does, but maybe I do not understand the requirements.
Second question: If it does work, can it be simulated without the overhead of an extra process?
I believe the answer is "yes". You can have each thread "take the role of" the barrier process at the appropriate time. You just need a master mutex, held by whichever thread is currently "taking the role" of the barrier process. Details, details... OK, so the barrier_wait might look like:
Here
master_mutex
(a mutex),master_condition_variable
(a condition variable),waiter_count
(an unsigned integer),N
(another unsigned integer), andtime_to_die
(a Boolean) are all shared state allocated and initialized by barrier_init.waiter_count
is initialiazed to zero,time_to_die
to false, andN
to the number of threads the barrier is waiting for.Then barrier_destroy would be:
Not sure about all the details concerning signal handling etc... But the basic idea of "last one out turns off the lights" is workable, I think.