程序在长时间运行时停止
我正在运行 Ubuntu 服务器 10.04.3 的计算机上运行模拟。短时间运行(<24 小时)运行良好,但长时间运行最终会停止。我所说的停顿是指程序不再获得任何 CPU 时间,但它仍然将所有信息保存在内存中。为了运行这些模拟,我通过 SSH 并 nohup 程序并将所有输出通过管道传输到文件。
其他信息:
系统绝对没有耗尽 RAM。程序在完成之前不需要读取或写入硬盘;计算完全在内存中完成。该程序没有被杀死,因为它在停止后仍然有一个 PID。我正在使用 openmp,但增加了最大进程数,并且最大时间是无限的。我正在使用 ARPACK fortran 库找到矩阵的最大特征值。
关于导致此行为的原因或如何恢复我当前停滞的程序有什么想法吗?
I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.
Miscellaneous information:
The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.
Any thoughts on what is causing this behavior or how to resume my currently stalled program?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为这是您标签中的 OpenMP 程序,尽管您从未真正声明过这一点。 ARPACK 线程安全吗?
听起来您遇到了死锁(在 MPI 程序中比 OpenMP 更常见,但这绝对是可能的)。要做的第一件事是在打开调试标志的情况下进行编译,然后下次发现此问题时,附加调试器并找出各个线程正在做什么。例如,对于 gdb,显示了一些在线程之间切换的说明 这里。
I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?
It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.
下次你的程序“停顿”时,将 GDB 附加到它并执行
线程应用所有位置
。僵局。
通常,在 UNIX 上,您不需要打开调试标志来重建以获得有意义的堆栈跟踪。您不会获得文件/行号,但诊断问题可能不需要它们。
Next time your program "stalls", attach GDB to it and do
thread apply all where
.deadlock.
Generally on UNIX you don't need to rebuild with debug flags on to get a meaningful stack trace. You wouldn't get file/line numbers, but they may not be necessary to diagnose the problem.
理解正在运行的程序(即进程)正在执行的操作的一种可能方法是使用
gdb program *pid*
将调试器附加到它(仅当程序已使用使用-g
启用调试),或者使用strace对其进行调试,使用strace -p *pid*
。 strace 命令是一个实用程序(从技术上讲,是在 ptrace 系统调用接口之上构建的专用调试器),它显示程序或进程完成的所有系统调用。还有一个名为
ltrace
的变体,它可以拦截对动态库中函数的调用。要感受一下,可以尝试一下
strace ls
当然,如果正在运行的程序没有执行任何系统调用,
strace
不会对您有太大帮助。A possible way of understanding what a running program (that is, a process) is doing is to attach a debugger to it with
gdb program *pid*
(which works well only when the program has been compiled with debugging enabled with-g
), or to use strace on it, usingstrace -p *pid*
. thestrace
command is an utility (technically, a specialized debugger built above theptrace
system call interface) which shows you all the system calls done by a program or a process.There is also a variant, called
ltrace
that intercepts the call to functions in dynamic libraries.To get a feeling of it, try for instance
strace ls
Of course,
strace
won't help you much if the running program is not doing any system calls.