当前位置：文江博客话题详情

在Condor上运行时解决神秘段错误的方法/工具

发布于 2024-09-18 11:24:23 字数 1525 浏览 11 评论 0原文

我正在编写一个在计算集群上运行的 C 应用程序（使用 Condor）。我尝试了很多方法来揭示有问题的代码，但无济于事。

线索：

平均来说，当我在 15 台机器上运行代码 2 天时，我会遇到两到三个段错误（信号 11）。
当我在本地运行代码时，没有出现段错误。我在我的家用机器上运行了近三周。

尝试：

我在 valGrind 中本地运行了代码四天，没有出现内存错误。
我通过定义自己的信号处理程序捕获了段错误信号，以便可以输出一些程序状态。
现在，当发生段错误时，我可以使用回溯打印出当前堆栈。
我可以打印出变量值。
我创建了一个设置为当前行号的变量。
还尝试注释掉代码块，希望如果问题消失我会发现段错误。

遗憾的是，输出的行号是相当随机的。我不完全确定我可以用堆栈跟踪做什么。我假设它只记录发生段错误的函数的地址是否正确？

怀疑：

我怀疑秃鹰用来跨机器移动作业的检查点系统对内存损坏更敏感，这就是为什么我在本地看不到它。
该索引已被错误损坏，并且这些索引导致了段错误。这可以解释段错误发生在相当随机的行号上的事实。

更新

进一步研究我发现了以下链接：

LibSegFault - 用于自动捕获和打印有关段错误的状态数据的库。
堆栈展开（堆栈跟踪）使用 GCC 教程捕获段错误并获取有问题的指令的行号。

更新2

Greg建议查看condor日志并“将段错误与condor从检查点重新启动可执行文件时相关联”。查看日志，段错误都是在重新启动后立即发生的。当作业从一种类型的机器切换到另一种类型时，所有故障似乎都会发生。

更新3

段错误是由主机之间的差异引起的，通过设置condor提交文件中的“requiremets”字段，问题完全消失。

可以设置单个计算机：

requirements = machine == "hostname1" || machine == "hostname2"

或一整类计算机：

requirements = classOfMachinesName

请参阅要求示例此处< /a>

原文

I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.

Clues:

On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.

Attempts:

I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.

Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?

Suspicions:

I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.

UPDATE

Researching this some more I've found the following links:

LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.

UPDATE 2

Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.

UPDATE 3

The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.

One can set individual machines:

requirements = machine == "hostname1" || machine == "hostname2"

or an entire class of machines:

requirements = classOfMachinesName

See requirements example here

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

欢烬 2024-09-25 11:24:23

如果可以的话，进行调试编译，并在 gdb 下运行。
或者，转储核心并将其加载到调试器中。

mpich有内置的调试器，或者你也可以购买商业并行调试器。

然后，您可以单步执行代码以查看调试器中发生的情况

http://nmi.cs.wisc .edu/node/1610

http://nmi.cs.wisc.edu/节点/1611

回复收藏 0 原文

与君绝 2024-09-25 11:24:23

发生段错误时可以创建核心转储吗？然后，您可以调试此转储以尝试找出代码崩溃时的状态。

看看是什么指令导致了这个错误。它是一条有效的指令还是您正在尝试执行数据？如果有效，它试图访问什么内存？这个指针是从哪里来的。您需要缩小故障位置的范围（堆栈损坏、堆损坏、未初始化的指针、访问无效内存）。如果是损坏，请查看损坏区域中是否有任何明显的数据（指向符号的指针、看起来像结构中某些内容的数据，...）。您的内存分配器可能已经具有用于调试某些损坏的内置功能（请参阅 Linux 上的 MALLOC_CHECK_ 或 Mac OS 上的 MallocGuardEdges）。常见的情况是使用已 free() 的内存，因此记录 malloc() / free() 对可能会有所帮助。

回复收藏 0 原文