C 库中的文件句柄泄漏(也许)会给 NFS 带来麻烦(+python,但这只是偶然的)
这是一个很酷的问题。
我有一个 python 脚本(main),它调用一个 python 模块(foo.py),而该模块又调用另一个 python 模块(barwrapper.py),使用 LoadLibrary 动态打开和访问 libbar.so 库。
libbar 和整个链的其余部分打开并创建文件来执行其任务。当我们在主 python 脚本中发出 rmtree 来删除导入模块创建的临时目录时,问题就出现了。 rmtree 在脚本末尾、退出之前调用。调用失败,因为该目录包含 .nfs-whatever
隐藏文件,我猜这些文件是已删除的文件。这些文件显然在代码中保持打开状态,迫使 nfs 将它们移动到这些 .nfs-whatever
文件,直到文件描述符被释放。这种情况在其他文件系统中不会出现,因为与所保存的描述符关联的文件被有效地删除,但内核仍可访问,直到描述符关闭。
我们强烈怀疑 .so 库正在泄漏文件描述符,并且这些未关闭的文件在清理时破坏了 rmtree party。我考虑过在 barwrapper 中卸载 .so 文件,但显然没有办法做到这一点,而且我不确定 dynloader 是否会真正从进程空间中删除 lib 并关闭描述符,或者它是否只是将其标记为已卸载就是这样,等待被其他东西取代,但描述符已泄露。
我真的想不出解决该问题的其他解决方法(除了修复泄漏之外,这是我们不想做的事情,因为它是第三方库)。显然,它只发生在 nfs 上。您知道我们可以尝试修复它吗?
here is a quite cool problem.
I have a python script (main) that calls a python module (foo.py) which in turns calls another python module (barwrapper.py) uses LoadLibrary to dynamically open and access a libbar.so library.
libbar and the whole rest of the chain open and create files to perform their task. The problem arises when we issue a rmtree in the main python script to get rid of the temporary directory created by the imported modules. rmtree is invoked at the end of the script, just before exiting. The call fails because the directory contains .nfs-whatever
hidden files, which I guess are the removed files. These files apparently are kept open in the code, forcing nfs to move them to these .nfs-whatever
files until the file descriptor is released. This situation does not arise in other filesystems, because files associated to held descriptors are effectively removed but kept accessible by the kernel until the descriptor is closed.
We strongly suspect that the .so library is leaking file descriptors, and these non-closed files ruin the rmtree party at cleanup time. I thought about unloading the .so file in barwrapper, but apparently there's no way to do that, and I am not sure if the dynloader will actually remove the lib from the process space and close the descriptors, or if it will just mark it unloaded and that's it, waiting to be replaced by other stuff, but with the descriptors leaked.
I can't really think of other workarounds to the problem (apart from fixing the leaks, something we would not like to do, as it's a 3rd party library). Clearly, it happens only on nfs. Do you have any idea we can try out to fix it ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
内核会跟踪文件描述符,因此即使你让 python 卸载 .so 并释放内存,它也不会知道关闭泄漏的文件描述符。唯一想到的是在分叉后导入 .so,并且仅在分叉子进程退出后进行清理(并且文件句柄在内核退出时隐式关闭)。
The kernel keeps track of file descriptors, so even if you got python to unload the .so and release the memory, it would not know to close the leaked file descriptors. The only thing that comes to mind is importing the .so after forking, and only cleaning up after the forked child process has exited (and the file handles implicitly closed on exit by the kernel).
好的解决方案是修复句柄泄漏,但如果您不确定谁泄漏,也许是 strace 调用将帮助您定位泄漏并将错误提交给第 3 方库的维护者(或者更好,如果它是开源库,请尝试提交补丁;))。
另一方面,也许 nfs 分区上的 umount/mount 有助于强制关闭句柄。
The good solution is to fix the handles leak, but if you're not sure of who is leaking, maybe a strace call would help you to localize the leak and submit the bug to the maintainers of the 3rd party library (or better if it is an open source library, try to submit a patch ;) ).
On the other hand, maybe a umount/mount on the nfs partition could help to force to close the handles.