通过基于 Python 的守护进程在 NFS 共享上执行文件 I/O 时需要特别注意吗?
我有 一个基于 python 的守护进程,它通过 HTTP 向某些命令行工具提供类似 REST 的接口 。该工具的一般性质是接受请求,执行一些命令行操作,将腌制的数据结构存储到磁盘,并将一些数据返回给调用者。守护进程启动时会生成一个辅助线程,它定期查看磁盘上的腌制数据,并根据数据中的内容进行一些清理。
如果腌制数据所在的磁盘恰好是 Linux 计算机上的本地磁盘,那么这种方法就可以正常工作。如果您切换到 NFS 安装的磁盘,守护进程启动时一切正常,但随着时间的推移,NFS 安装的共享“消失”,并且守护进程无法再通过诸如 os.getcwd()< 之类的调用来判断它在磁盘上的位置。 /代码>。您将开始看到它记录如下错误:
2011-07-13 09:19:36,238 INFO Retrieved submit directory '/tech/condor_logs/submit'
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): handler.path: /condor/submit?queue=Q2%40scheduler
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): submitting from temporary submission directory '/tech/condor_logs/submit/tmpoF8YXk'
2011-07-13 09:19:36,240 ERROR Caught un-handled exception: [Errno 2] No such file or directory
2011-07-13 09:19:36,241 INFO submitter - - [13/Jul/2011 09:19:36] "POST /condor/submit?queue=Q2%40scheduler HTTP/1.1" 500 -
未处理的异常解析为守护程序无法再看到磁盘。此时,任何使用 os.getcwd() 找出守护进程当前工作目录的尝试都将失败。即使尝试更改为 NFS 挂载 /tech
的根目录,也会失败。一直以来,logger.logging.*
方法都乐意将日志和调试消息写入位于 NFS 安装共享上的日志文件 (/tech/condor_logs/logs/CondorAgentLog.
该磁盘肯定仍然可用。还有其他基于 C++ 的守护进程,当时在此共享上的读写频率比基于 python 的守护进程高得多。
我在调试这个问题时陷入了僵局。既然它在本地磁盘上运行,那么代码的总体结构一定很好,对吗? NFS 安装的共享和我的代码有些不兼容,但我不知道它可能是什么。
在处理长时间运行的 Python 守护进程(该守护进程将频繁读写 NFS 安装的文件共享)时,是否必须考虑一些特殊注意事项?
如果有人想查看处理该问题的部分代码, HTTP 请求并将 pickled 对象写入磁盘位于 github 这里。子线程通过读取 pickled 对象来定期清理磁盘中的内容的部分是 此处。
I have a python-based daemon that provides a REST-like interface over HTTP to some command line tools. The general nature of the tool is to take in a request, perform some command-line action, store a pickled data structure to disk, and return some data to the caller. There's a secondary thread spawned on daemon startup that looks at that pickled data on disk periodically and does some cleanup based on what's in the data.
This works just fine if the disk where the pickled data resides happens to be local disk on a Linux machine. If you switch to NFS-mounted disk the daemon starts life just fine, but over time the NFS-mounted share "disappears" and the daemon can no longer tell where it is on disk with calls like os.getcwd()
. You'll start to see it log errors like:
2011-07-13 09:19:36,238 INFO Retrieved submit directory '/tech/condor_logs/submit'
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): handler.path: /condor/submit?queue=Q2%40scheduler
2011-07-13 09:19:36,239 DEBUG CondorAgent.post_submit.do_submit(): submitting from temporary submission directory '/tech/condor_logs/submit/tmpoF8YXk'
2011-07-13 09:19:36,240 ERROR Caught un-handled exception: [Errno 2] No such file or directory
2011-07-13 09:19:36,241 INFO submitter - - [13/Jul/2011 09:19:36] "POST /condor/submit?queue=Q2%40scheduler HTTP/1.1" 500 -
The un-handled exception resolves to the daemon being unable to see the disk any more. Any attempts to figure out the daemon's current working directory with os.getcwd()
at this point will fail. Even trying to change to the root of the NFS mount /tech
, will fail. All the while the logger.logging.*
methods are happily writing out log and debug messages to a log file located on the NFS-mounted share at /tech/condor_logs/logs/CondorAgentLog
.
The disk is most definitely still available. There are other, C++-based daemons, reading and writing with a much higher rate of frequency on this share at the time that the python-based daemon.
I've come to an impasse debugging this problem. Since it works on local disk the general structure of the code must be good, right? There's something about NFS-mounted shares and my code that are incompatible but I can't tell what it might be.
Are there special considerations one must implement when dealing with a long-running Python daemon that will be reading and writing frequently to an NFS-mounted file share?
If anyone wants to see the code the portion that handles the HTTP request and writes the pickled object to disk is in github here. And the portion that the sub-thread uses to do periodic cleanup of stuff from disk by reading the pickled objects is here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我已经找到了问题的答案,但它与我在 NFS 共享上执行文件 I/O 的事实无关。事实证明,如果 I/O 通过 NFS 挂载而不是本地磁盘,问题就会更快地出现。
一条关键信息是代码是通过 SocketServer.ThreadingMixIn 和 HTTPServer 类以线程方式运行的。
我的处理程序代码正在执行类似于以下内容的操作:
或多或少,这就是流程。
问题不在于 I/O 是在 NFS 上完成的。问题是 os.getcwd() 不是线程本地的,而是进程全局的。因此,当一个线程发出
chdir()
以移动到刚刚在base_dir
下创建的临时空间时,下一个线程调用os.getcwd()
> 将获取另一个线程的temporary_dir
而不是 HTTP 服务器启动的静态基目录。还有其他一些人报告了类似的问题 此处 和此处。
解决方案是摆脱
chdir()
和getcwd()
调用。启动并停留在一个目录中并通过绝对路径访问其他所有内容。NFS 与本地文件的比较让我循环了一遍。事实证明,
当文件系统是 NFS 时,与本地文件系统相比,我的 block: 运行速度慢得多。它使问题发生得更快,因为它增加了一个线程仍在
doSomething()
中而另一个线程正在运行current_dir = os.getcwd()
部分的机会代码块。在本地磁盘上,线程在整个代码块中移动的速度非常快,因此很少会像这样交叉。但是,给它足够的时间(大约一周),使用本地磁盘时就会出现这个问题。这是关于 Python 线程安全操作的一个重要教训!
I have the answer to my problem and it had nothing to with the fact that I was doing file I/O on an NFS share. It turns out the problem just showed up faster if the I/O was over an NFS mount versus local disk.
A key piece of information is that the code was running threaded via the
SocketServer.ThreadingMixIn
andHTTPServer
classes.My handler code was doing something close to the following:
That's the flow, more or less.
The problem wasn't that the I/O was being done on NFS. The problem was that
os.getcwd()
isn't thread-local, it's a process global. So as one thread issued achdir()
to move to the temporary space it just created underbase_dir
, the next thread callingos.getcwd()
would get the other thread'stemporary_dir
instead of the static base directory where the HTTP server was started in.There's some other people reporting similar issues here and here.
The solution was to get rid of the
chdir()
andgetcwd()
calls. To startup and stay in one directory and access everything else through absolute paths.The NFS vs local file stuff through me for a loop. It turns out my block:
was running much slower when the filesystem was NFS versus local. It made the problem occur much sooner because it increased the chances that one thread was still in
doSomething()
while another thread was running thecurrent_dir = os.getcwd()
part of the code block. On local disk the threads moved through the entire code block so quickly they rarely intersected like that. But, give it enough time (about a week), and the problem would crop up when using local disk.So a big lesson learned on thread safe operations in Python!
从字面上回答这个问题,是的,NFS 存在一些问题。例如:
NFS 不是缓存一致的,因此如果多个客户端正在访问一个文件,它们可能会获取过时的数据。
特别是,您不能依赖 O_APPEND 以原子方式追加到文件。
根据 NFS 服务器的不同,O_CREAT|O_EXCL 可能无法正常工作(至少在现代 Linux 上它可以正常工作)。
尤其是较旧的 NFS 服务器,锁定支持不足或完全不起作用。即使在更现代的服务器上,服务器和/或客户端重新启动后锁恢复也可能成为问题。 NFSv4 是一种有状态的协议,应该比旧的协议版本更强大。
话虽如此,听起来您的问题与上述任何问题都没有真正的关系。根据我的经验,Condor 守护进程会在某个时候根据配置清理已完成作业留下的文件。我的猜测是在这里寻找嫌疑人。
To answer the question literally, yes there are some gotchas with NFS. E.g.:
NFS is not cache coherent, so if several clients are accessing a file they might get stale data.
In particular, you cannot rely on O_APPEND to atomically append to files.
Depending on the NFS server, O_CREAT|O_EXCL might not work properly (it does work properly on modern Linux, at least).
Especially older NFS servers have deficient or completely non-working locking support. Even on more modern servers, lock recovery can be a problem after server and/or client reboot. NFSv4, a stateful protocol, ought to be more robust here than older protocol versions.
All this being said, it sounds like you problem isn't really related to any of the above. In my experience, the Condor daemons will at some point, depending on the configuration, clean up files left from jobs that have finished. My guess would be to look for the suspect here.