当将信号量递减至零的进程崩溃时,如何恢复信号量?
我有多个使用 g++ 编译的应用程序,在 Ubuntu 中运行。我使用命名信号量来协调不同的进程。
一切正常除了在以下情况下:如果其中一个进程调用 sem_wait()
或 sem_timedwait()
来减少信号量,然后崩溃或者在有机会调用 sem_post() 之前被杀死 -9,那么从那一刻起,指定的信号量就“不可用”。
我所说的“不可用”是指信号量计数现在为零,而本应将其增加回 1 的进程已死亡或被终止。
我找不到 sem_*()
API 可以告诉我上次减少它的进程已经崩溃。
我是否在某处缺少 API?
以下是我如何打开命名信号量:
sem_t *sem = sem_open( "/testing",
O_CREAT | // create the semaphore if it does not already exist
O_CLOEXEC , // close on execute
S_IRWXU | // permissions: user
S_IRWXG | // permissions: group
S_IRWXO , // permissions: other
1 ); // initial value of the semaphore
以下是我如何递减它:
struct timespec timeout = { 0, 0 };
clock_gettime( CLOCK_REALTIME, &timeout );
timeout.tv_sec += 5;
if ( sem_timedwait( sem, &timeout ) )
{
throw "timeout while waiting for semaphore";
}
I have multiple apps compiled with g++, running in Ubuntu. I'm using named semaphores to co-ordinate between different processes.
All works fine except in the following situation: If one of the processes calls sem_wait()
or sem_timedwait()
to decrement the semaphore and then crashes or is killed -9 before it gets a chance to call sem_post()
, then from that moment on, the named semaphore is "unusable".
By "unusable", what I mean is the semaphore count is now zero, and the process that should have incremented it back to 1 has died or been killed.
I cannot find a sem_*()
API that might tell me the process that last decremented it has crashed.
Am I missing an API somewhere?
Here is how I open the named semaphore:
sem_t *sem = sem_open( "/testing",
O_CREAT | // create the semaphore if it does not already exist
O_CLOEXEC , // close on execute
S_IRWXU | // permissions: user
S_IRWXG | // permissions: group
S_IRWXO , // permissions: other
1 ); // initial value of the semaphore
Here is how I decrement it:
struct timespec timeout = { 0, 0 };
clock_gettime( CLOCK_REALTIME, &timeout );
timeout.tv_sec += 5;
if ( sem_timedwait( sem, &timeout ) )
{
throw "timeout while waiting for semaphore";
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
事实证明,没有办法可靠地恢复信号量。当然,任何人都可以
post_sem()
到指定的信号量,使计数再次增加到零以上,但如何判断何时需要这样的恢复?提供的 API 太有限,并且没有以任何方式指示这种情况何时发生。请注意还可用的 ipc 工具 - 常用工具
ipcmk
、ipcrm
和ipcs
仅适用于过时的 SysV 信号量。它们特别不适用于新的 POSIX 信号量。但看起来还有其他东西可以用来锁定东西,当应用程序以信号处理程序无法捕获的方式终止时,操作系统会自动释放这些东西。两个示例:绑定到特定端口的侦听套接字,或特定文件上的锁定。
我认为文件锁定是我需要的解决方案。因此,我使用的是:
并且
当应用程序以任何方式退出时,文件会自动关闭,这也会释放文件锁。然后等待“信号量”的其他客户端应用程序可以按预期自由地继续。
谢谢你们的帮助,伙计们。
更新:
12 年后,我想我应该指出 posix 互斥体确实具有“健壮”属性。这样,如果互斥锁的所有者被杀死或退出,下一个锁定互斥锁的用户将获得
EOWNERDEAD
的非错误返回值,从而允许恢复互斥锁。这将使其类似于文件和套接字锁定解决方案。有关详细信息,请参阅pthread_mutexattr_setrobust()
和pthread_mutex_concient()
。谢谢 Reinier Torenbeek 的提示。Turns out there isn't a way to reliably recover the semaphore. Sure, anyone can
post_sem()
to the named semaphore to get the count to increase past zero again, but how to tell when such a recovery is needed? The API provided is too limited and doesn't indicate in any way when this has happened.Beware of the ipc tools also available -- the common tools
ipcmk
,ipcrm
, andipcs
are only for the outdated SysV semaphores. They specifically do not work with the new POSIX semaphores.But it looks like there are other things that can be used to lock things, which the operating system does automatically release when an application dies in a way that cannot be caught in a signal handler. Two examples: a listening socket bound to a particular port, or a lock on a specific file.
I decided the lock on a file is the solution I needed. So instead of a
sem_wait()
andsem_post()
call, I'm using:and
When the application exits in any way, the file is automatically closed which also releases the file lock. Other client apps waiting for the "semaphore" are then free to proceed as expected.
Thanks for the help, guys.
UPDATE:
12 years later, thought I should point out that posix mutexes do have a "robust" attribute. That way, if the owner of the mutex gets killed or exits, the next user to lock the mutex will get the non-error return value of
EOWNERDEAD
, allowing the mutex to be recovered. This will make it similar to the file and socket locking solution. Look uppthread_mutexattr_setrobust()
andpthread_mutex_consistent()
for details. Thanks, Reinier Torenbeek, for this hint.使用锁定文件而不是信号量,与@Stéphane 的解决方案非常相似,但没有flock() 调用。您可以简单地使用独占锁打开文件:
Use a lock file instead of a semaphore, much like @Stéphane's solution but without the flock() calls. You can simply open the file using an exclusive lock:
这是管理信号量时的典型问题。有些程序使用单个进程来管理信号量的初始化/删除。通常此过程仅执行此操作,而不执行其他操作。您的其他应用程序可以等待信号量可用。我见过使用 SYSV 类型 API 完成此操作,但没有使用 POSIX。与“Duck”提到的类似,在 semop() 调用中使用 SEM_UNDO 标志。
但是,根据您提供的信息,我建议您不要使用信号量。特别是当您的进程有被终止或崩溃的危险时。尝试使用操作系统会自动为您清理的内容。
This is a typical problem when managing semaphores. Some programs use a single process to manage the initialization/deletion of the semaphore. Usually this process does just this and nothing else. Your other applications can wait until the semaphore is available. I've seen this done with the SYSV type API, but not with POSIX. Similar to what 'Duck' mentioned, using the SEM_UNDO flag in your semop() call.
But, with the information that you've provided I would suggest that you do not to use semaphores. Especially if your process is in danger of being killed or crashing. Try to use something that the OS will cleanup automagically for you.
您需要仔细检查,但我相信 sem_post 可以从信号处理程序中调用。如果您能够发现一些导致流程中断的情况,这可能会有所帮助。
与互斥锁不同,任何进程或线程(具有权限)都可以发送到信号量。您可以编写一个简单的实用程序来重置它。想必您知道系统何时陷入死锁。您可以将其关闭并运行实用程序。
此外,信号音通常列在 /dev/shm 下,您可以将其删除。
SysV 信号量更适合这种情况。您可以指定 SEM_UNDO,其中系统将在进程终止时取消对信号量所做的更改。他们还能够告诉您最后一个进程 ID 来更改信号量。
You'll need to double check but I believe sem_post can be called from a signal handler. If you are able to catch some of the situations that are bringing down the process this might help.
Unlike a mutex any process or thread (with permissions) can post to the semaphore. You can write a simple utility to reset it. Presumably you know when your system has deadlocked. You can bring it down and run the utility program.
Also the semaphone is usually listed under /dev/shm and you can remove it.
SysV semaphores are more accommodating for this scenario. You can specify SEM_UNDO, in which the system will back out changes to the semaphore made by a process if it dies. They also have the ability to tell you the last process id to alter the semaphore.
您应该能够使用 lsof 从 shell 中找到它。那么你可以删除它吗?
更新
啊,是的...
man -k semaphore
来救援。看来你可以使用
ipcrm
来摆脱信号量。看来你不是第一个遇到这个问题的人。You should be able to find it from the shell using
lsof
. Then possibly you can delete it?Update
Ah yes...
man -k semaphore
to the rescue.It seems you can use
ipcrm
to get rid of a semaphore. Seems you aren't the first with this problem.如果进程被杀死,那么将没有任何直接的方法来确定它已经消失。
您可以对您拥有的所有信号量进行某种定期完整性检查 - 使用 semctl (cmd=GETPID) 查找在您描述的状态下接触每个信号量的最后一个进程的 PID,然后检查该进程是否仍然存在。如果没有,请执行清理。
If the process was KILLed then there won't be any direct way to determine that it has gone away.
You could operate some kind of periodic integrity check across all the semaphores you have - use semctl (cmd=GETPID) to find the PID for the last process that touched each semaphore in the state you describe, then check whether that process is still around. If not, perform clean up.
如果您使用命名信号量,则可以使用类似 lsof 或 fusion 中使用的算法。
考虑这些:
1.每个命名的 POSIX 信号量通常在 tmpfs 文件系统中创建一个文件,路径为:
2.每个进程在 linux 中都有一个 map_files,路径为:
这些映射文件,显示了进程内存的哪一部分映射到什么!
因此,使用以下步骤,您可以确定指定信号量是否仍被另一个进程打开:
1-(可选)查找指定信号量的确切路径(如果它不在
/dev/shm
下) )查找指针在内存中的地址位置(通常将指针的地址转换为整数类型)并将其转换为十六进制(即结果:
0xffff1234
)数字,然后使用此路径:/proc/self/map_files/ffff1234-*
应该只有一个文件满足这一条件。
获取该文件的符号链接目标。它是指定信号量的完整路径。
2-迭代所有进程以查找其符号链接目标与指定信号量的完整路径匹配的映射文件。如果有,则该信号量正在实际使用中,但如果没有,则您可以安全地取消链接指定信号量并再次重新打开它以供使用。
UPDATE
在步骤2中,迭代所有进程时,最好使用文件
/proc/[,而不是迭代
并搜索其中指定信号量文件的完整路径(即:map_file
文件夹中的所有文件PID]/maps/dev/shm/sem_xyz
)。在这种方法中,即使某些其他程序取消了指定信号量的链接,但该信号量仍在其他进程中使用,仍然可以找到它,但在其文件路径末尾附加了“(已删除)”标志。
If you use a named semaphore, then you can use an algorithm like the one used in
lsof
orfuser
.Take these in your consideration:
1.Each named POSIX semaphore creates a file in a tmpfs file system usually under the path:
2.Each process has a map_files in linux, under the path:
These map files, shows which part of a process memory map to what!
So using these steps, you can find whether the named semaphore is still opened by another process or not:
1- (Optional) Find the exact path of named semaphore (In case its not under
/dev/shm
)Find the address location of the pointer in the memory (usually with a casting of the address of the pointer to in integer type) and convert it to hexadecimal (i.e result:
0xffff1234
) number and then use this path:/proc/self/map_files/ffff1234-*
there should be only one file that fulfills this criteria.
Get the symbolic link target of that file. It is the full path of the named semaphore.
2- Iterate over all processes to find a map file that its symbolic link taget matches the full path of the named semaphore. If there is one, then the semaphore is in real use, but if there is none, then you can safely unlink the named semaphore and reopen it again for your usage.
UPDATE
In step 2, when iterating over all processes, instead of iterating over all files in the folder
map_file
, it is beter to use the file/proc/[PID]/maps
and search the full path of the named semaphore file (i.e:/dev/shm/sem_xyz
) inside it.In this approach, even if some other programs unlinked the named semaphore but the semaphore is still using in other processes, it still can be found but a flag of "(deleted)" is appended at the end of its file path.
只需在
sem_open()
之后立即执行sem_unlink()
即可。 Linux 将在所有进程关闭资源(包括内部关闭)后删除该资源。Simply do a
sem_unlink()
immediately after thesem_open()
. Linux will remove after all processes have closed the resource, which includes internal closes.