调试客户盒子上生成的核心文件
我们通过在客户的机器上运行我们的软件来获取核心文件。不幸的是,因为我们总是使用 -O2 编译而没有调试符号,这导致了我们无法弄清楚为什么崩溃的情况,我们修改了构建,所以现在它们生成 -g 和 -一起吸氧。然后,我们建议客户运行 -g 二进制文件,以便更容易调试。
我有几个问题:
- 当核心文件是从 Linux 发行版而不是我们在 Dev 中运行的发行版生成时会发生什么?堆栈跟踪有意义吗?
- 有没有关于 Linux 或 Solaris 上调试的好书?一些面向示例的东西会很棒。我正在寻找现实生活中的例子来找出例程崩溃的原因以及作者如何找到解决方案。中级到高级水平的东西会更好,因为我已经这样做了一段时间了。一些组装也会很好。
这是一个崩溃示例,需要我们告诉客户获取 -g 版本。二进制文件的:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
理想情况下,我想找出应用程序崩溃的确切原因 - 我怀疑这是内存损坏,但我不能 100% 确定。
严禁远程调试。
谢谢
We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
I have a few questions:
- What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
- Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
Remote debugging is strictly not allowed.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果可执行文件是动态链接的,就像您的一样,GDB 生成的堆栈(很可能)没有有意义。
原因是:GDB 知道您的可执行文件因在地址
0x00454ff1
处调用libc.so.6
中的某些内容而崩溃,但它不知道该地址处的代码是什么。因此,它会查看您的libc.so.6
副本,并发现它位于select
中,因此它会打印出来。但是
0x00454ff1
也在libc.so.6
的客户副本中进行选择的可能性非常小。客户很可能在该地址执行了其他操作,可能是中止
。您可以使用
disas select
,并观察0x00454ff1
要么位于指令中间,要么前一条指令不是CALL
。如果其中任何一个成立,您的堆栈跟踪就毫无意义。不过,您可以帮助自己:您只需从客户系统获取
(gdb) 信息共享
中列出的所有库的副本即可。 它们压缩让客户用例如然后在您的系统上将
更好的方法是:
-g -O2 -o myexe.dbg
构建strip -g myexe.dbg -o myexe
myexe
给客户core
时,使用myexe.dbg
对其进行调试您将获得完整的符号信息(文件/行、局部变量),无需将特殊的二进制文件发送到客户,并且不会透露太多有关您的消息来源的详细信息。
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.
The reason: GDB knows that your executable crashed by calling something in
libc.so.6
at address0x00454ff1
, but it doesn't know what code was at that address. So it looks into your copy oflibc.so.6
and discovers that this is inselect
, so it prints that.But the chances that
0x00454ff1
is also in select in your customers copy oflibc.so.6
are quite small. Most likely the customer had some other procedure at that address, perhapsabort
.You can use
disas select
, and observe that0x00454ff1
is either in the middle of instruction, or that the previous instruction is not aCALL
. If either of these holds, your stack trace is meaningless.You can however help yourself: you just need to get a copy of all libraries that are listed in
(gdb) info shared
from the customer system. Have the customer tar them up with e.g.Then, on your system:
A much better approach is:
-g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
myexe
to customerscore
, usemyexe.dbg
to debug itYou'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
您确实可以从故障转储中获取有用的信息,甚至可以从优化的编译中获取有用的信息(尽管从技术上讲,这就是所谓的“一大痛苦”。)
-g
编译确实更好,是的,即使发生转储的计算机是另一个发行版,您也可以这样做。基本上,有一点需要注意的是,所有重要信息都包含在可执行文件中并最终出现在转储中。当您将核心文件与可执行文件匹配时,调试器将能够告诉您崩溃发生的位置并显示堆栈。这本身应该有很大帮助。你还应该尽可能多地了解它发生的情况——他们能可靠地重现它吗?如果是这样,你能重现它吗?
现在,需要注意的是:“一切都在那里”的概念被打破的地方是共享对象文件,
.so
文件。如果由于这些问题而失败,您将不会获得所需的符号表;你可能只能看到它发生在哪个库.so
中。有很多关于调试的书籍,但我想不出我推荐的一本。
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a
-g
compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files,
.so
files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library.so
it happens in.There are a number of books about debugging, but I can't think of one I'd recommend.
复制我的问题中的解决方案这被认为是此的重复。
已接受的解决方案中的
set solib-absolute-prefix
对我没有帮助。set sysroot
对于让 gdb 加载本地提供的库是绝对必要的。以下是我用来打开核心转储的命令列表:
Copying the resolution from my question which was considered a duplicate of this.
set solib-absolute-prefix
from the accepted solution did not help for me.set sysroot
was absolutely necessary to make gdb load locally provided libs.Here is the list of commands I used to open core dump:
据我记得,您不需要要求客户使用使用 -g 选项构建的二进制文件来运行。需要的是你应该有一个带有 -g 选项的构建。这样您就可以加载核心文件,它将显示整个堆栈跟踪。我记得几周前,我创建了核心文件,带有构建(-g)和不带有-g,并且核心的大小是相同的。
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
检查在遍历堆栈时看到的局部变量的值?特别是在 select() 调用周围。在客户的机器上执行此操作,只需加载转储并遍历堆栈...
另外,检查 DEV 和 PROD 平台上的 FD_SETSIZE 值!
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !