在Condor上运行时解决神秘段错误的方法/工具
我正在编写一个在计算集群上运行的 C 应用程序(使用 Condor)。我尝试了很多方法来揭示有问题的代码,但无济于事。
线索:
- 平均来说,当我在 15 台机器上运行代码 2 天时,我会遇到两到三个段错误(信号 11)。
- 当我在本地运行代码时,没有出现段错误。我在我的家用机器上运行了近三周。
尝试:
- 我在 valGrind 中本地运行了代码四天,没有出现内存错误。
- 我通过定义自己的信号处理程序捕获了段错误信号,以便可以输出一些程序状态。
- 现在,当发生段错误时,我可以使用回溯打印出当前堆栈。
- 我可以打印出变量值。
- 我创建了一个设置为当前行号的变量。
- 还尝试注释掉代码块,希望如果问题消失我会发现段错误。
遗憾的是,输出的行号是相当随机的。我不完全确定我可以用堆栈跟踪做什么。我假设它只记录发生段错误的函数的地址是否正确?
怀疑:
- 我怀疑秃鹰用来跨机器移动作业的检查点系统对内存损坏更敏感,这就是为什么我在本地看不到它。
- 该索引已被错误损坏,并且这些索引导致了段错误。这可以解释段错误发生在相当随机的行号上的事实。
更新
进一步研究我发现了以下链接:
LibSegFault - 用于自动捕获和打印有关段错误的状态数据的库。
堆栈展开(堆栈跟踪)使用 GCC 教程捕获段错误并获取有问题的指令的行号。
更新2
Greg建议查看condor日志并“将段错误与condor从检查点重新启动可执行文件时相关联”。查看日志,段错误都是在重新启动后立即发生的。当作业从一种类型的机器切换到另一种类型时,所有故障似乎都会发生。
更新3
段错误是由主机之间的差异引起的,通过设置condor提交文件中的“requiremets”字段,问题完全消失。
可以设置单个计算机:
requirements = machine == "hostname1" || machine == "hostname2"
或一整类计算机:
requirements = classOfMachinesName
请参阅要求示例此处< /a>
I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
- On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
- When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
- I ran the code in valGrind for four days locally with no memory errors.
- I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
- Now when a segfault happens I can print out the current stack using backtrace.
- I can print out variable values.
- I created a variable which is set to the current line number.
- Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
- I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
- That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果可以的话,进行调试编译,并在 gdb 下运行。
或者,转储核心并将其加载到调试器中。
mpich有内置的调试器,或者你也可以购买商业并行调试器。
然后,您可以单步执行代码以查看调试器中发生的情况
http://nmi.cs.wisc .edu/node/1610
http://nmi.cs.wisc.edu/节点/1611
if you can, compile with debugging, and run under gdb.
alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger
http://nmi.cs.wisc.edu/node/1610
http://nmi.cs.wisc.edu/node/1611
发生段错误时可以创建核心转储吗?然后,您可以调试此转储以尝试找出代码崩溃时的状态。
看看是什么指令导致了这个错误。它是一条有效的指令还是您正在尝试执行数据?如果有效,它试图访问什么内存?这个指针是从哪里来的。您需要缩小故障位置的范围(堆栈损坏、堆损坏、未初始化的指针、访问无效内存)。如果是损坏,请查看损坏区域中是否有任何明显的数据(指向符号的指针、看起来像结构中某些内容的数据,...)。您的内存分配器可能已经具有用于调试某些损坏的内置功能(请参阅 Linux 上的
MALLOC_CHECK_
或 Mac OS 上的MallocGuardEdges
)。常见的情况是使用已 free() 的内存,因此记录 malloc() / free() 对可能会有所帮助。Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.
Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see
MALLOC_CHECK_
on Linux orMallocGuardEdges
on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.如果您使用condor_compile工具将您的代码与condor检查点代码重新链接,它会做一些与普通链接不同的事情。最重要的是,它静态链接您的代码,并使用它自己的 malloc。另一个很大的区别是,Condor 将在外国机器上运行它,其中的环境可能与您预期会导致问题的环境有很大不同。
Condor_compile 生成的可执行文件可以作为独立的二进制文件在Condor 系统之外运行。如果您在Condor之外本地运行condor_compile发出的二进制文件,您还会看到段错误吗?
如果没有,您能否将段错误与Condor从检查点重新启动可执行文件时相关联(用户日志会告诉您何时发生这种情况)。
If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.
The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?
If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).
你已经尝试了我能想到的大部分方法。我唯一建议的另一件事是开始添加大量日志记录代码,并希望您可以缩小错误发生的范围。
You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.
你没有说的一件事是你必须有多大的灵活性来解决问题。
例如,您可以让系统停止并只运行您的应用程序吗?
另外,解决这些崩溃问题有多重要?
我假设大多数情况下你都是这样做的。这可能需要大量资源。
短期步骤是为每个变量添加大量“断言”(半手写)
以确保当您不希望它改变时它没有改变。当你经历长期过程时,这可以继续发挥作用。
长期来看——尝试在两个集群上运行它(可能是你的家庭计算机和一个虚拟机)。
您仍然看到段错误吗?如果不增加集群大小,直到开始看到段错误。
在最低配置上运行它(以获取段错误)并记录所有输入直到崩溃。使用您记录的输入自动运行系统,调整它们,直到您可以用最少的输入一致地获得崩溃。
那时环顾四周。如果您仍然找不到错误,那么您将不得不再次询问您在这些运行中收集的一些额外数据。
The one thing you do not say is how much flexibility you have to solve the problem.
Can you, for example, have the system come to a halt and just run your application?
Also how important are these crashes to solve?
I am assuming that for the most part you do. This may require a lot of resources.
The short term step is to put tons of "asserts" ( semi handwritten ) of each variable
to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.
Long term-- try running it on a cluster of two ( maybe your home computer and a VM ).
Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.
Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.
At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.