调试客户盒子上生成的核心文件

发布于 2025-01-11 11:06:51 字数 726 浏览 0 评论 0原文

我们通过在客户的机器上运行我们的软件来获取核心文件。不幸的是,因为我们总是使用 -O2 编译而没有调试符号,这导致了我们无法弄清楚为什么崩溃的情况,我们修改了构建,所以现在它们生成 -g 和 -一起吸氧。然后,我们建议客户运行 -g 二进制文件,以便更容易调试。

我有几个问题:

  1. 当核心文件是从 Linux 发行版而不是我们在 Dev 中运行的发行版生成时会发生什么?堆栈跟踪有意义吗?
  2. 有没有关于 Linux 或 Solaris 上调试的好书?一些面向示例的东西会很棒。我正在寻找现实生活中的例子来找出例程崩溃的原因以及作者如何找到解决方案。中级到高级水平的东西会更好,因为我已经这样做了一段时间了。一些组装也会很好。

这是一个崩溃示例,需要我们告诉客户获取 -g 版本。二进制文件的:

Program terminated with signal 11, Segmentation fault.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>

理想情况下,我想找出应用程序崩溃的确切原因 - 我怀疑这是内存损坏,但我不能 100% 确定。

严禁远程调试。

谢谢

We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.

I have a few questions:

  1. What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
  2. Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.

Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:

Program terminated with signal 11, Segmentation fault.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>

Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.

Remote debugging is strictly not allowed.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

夜清冷一曲。 2025-01-18 11:06:51

当核心文件是从 Linux 发行版而不是我们在 Dev 中运行的发行版生成时会发生什么?堆栈跟踪有意义吗?

如果可执行文件是动态链接的,就像您的一样,GDB 生成的堆栈(很可能)没有有意义。

原因是:GDB 知道您的可执行文件因在地址 0x00454ff1 处调用 libc.so.6 中的某些内容而崩溃,但它不知道该地址处的代码是什么。因此,它会查看libc.so.6副本,并发现它位于select中,因此它会打印出来。

但是 0x00454ff1 也在 libc.so.6客户副本中进行选择的可能性非常小。客户很可能在该地址执行了其他操作,可能是中止

您可以使用disas select,并观察0x00454ff1要么位于指令中间,要么前一条指令不是CALL。如果其中任何一个成立,您的堆栈跟踪就毫无意义。

不过,您可以帮助自己:您只需从客户系统获取(gdb) 信息共享 中列出的所有库的副本即可。 它们压缩

cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...

让客户用例如然后在您的系统上将

mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core  # Note: very important to set solib-... before loading core
(gdb) where      # Get meaningful stack trace!

然后,我们建议客户运行 -g 二进制文件,以便更容易调试。

更好的方法是:

  • 使用 -g -O2 -o myexe.dbg 构建
  • strip -g myexe.dbg -o myexe
  • 分发 myexe 给客户
  • 当客户获得 core 时,使用 myexe.dbg 对其进行调试

您将获得完整的符号信息(文件/行、局部变量),无需将特殊的二进制文件发送到客户,并且不会透露太多有关您的消息来源的详细信息。

What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?

It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.

The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.

But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.

You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.

You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.

cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...

Then, on your system:

mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core  # Note: very important to set solib-... before loading core
(gdb) where      # Get meaningful stack trace!

We then advice the Customer to run a -g binary so it becomes easier to debug.

A much better approach is:

  • build with -g -O2 -o myexe.dbg
  • strip -g myexe.dbg -o myexe
  • distribute myexe to customers
  • when a customer gets a core, use myexe.dbg to debug it

You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.

各空 2025-01-18 11:06:51

您确实可以从故障转储中获取有用的信息,甚至可以从优化的编译中获取有用的信息(尽管从技术上讲,这就是所谓的“一大痛苦”。)-g 编译确实更好,是的,即使发生转储的计算机是另一个发行版,您也可以这样做。基本上,有一点需要注意的是,所有重要信息都包含在可执行文件中并最终出现在转储中。

当您将核心文件与可执行文件匹配时,调试器将能够告诉您崩溃发生的位置并显示堆栈。这本身应该有很大帮助。你还应该尽可能多地了解它发生的情况——他们能可靠地重现它吗?如果是这样,你能重现它吗?

现在,需要注意的是:“一切都在那里”的概念被打破的地方是共享对象文件,.so 文件。如果由于这些问题而失败,您将不会获得所需的符号表;你可能只能看到它发生在哪个库 .so 中。

有很多关于调试的书籍,但我想不出我推荐的一本。

You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.

When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?

Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.

There are a number of books about debugging, but I can't think of one I'd recommend.

一刻暧昧 2025-01-18 11:06:51

复制我的问题中的解决方案这被认为是此的重复。

已接受的解决方案中的 set solib-absolute-prefix 对我没有帮助。 set sysroot 对于让 gdb 加载本地提供的库是绝对必要的。
以下是我用来打开核心转储的命令列表:

# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
#   1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
#   2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0

# load executable file
file ./thedarkmod.x64

# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .

# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /

# load core dump file
core core.6487

# print stacktrace
bt

Copying the resolution from my question which was considered a duplicate of this.

set solib-absolute-prefix from the accepted solution did not help for me. set sysroot was absolutely necessary to make gdb load locally provided libs.
Here is the list of commands I used to open core dump:

# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
#   1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
#   2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0

# load executable file
file ./thedarkmod.x64

# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .

# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /

# load core dump file
core core.6487

# print stacktrace
bt
喜你已久 2025-01-18 11:06:51

据我记得,您不需要要求客户使用使用 -g 选项构建的二进制文件来运行。需要的是你应该有一个带有 -g 选项的构建。这样您就可以加载核心文件,它将显示整个堆栈跟踪。我记得几周前,我创建了核心文件,带有构建(-g)和不带有-g,并且核心的大小是相同的。

As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.

囚你心 2025-01-18 11:06:51

检查在遍历堆栈时看到的局部变量的值?特别是在 select() 调用周围。在客户的机器上执行此操作,只需加载转储并遍历堆栈...

另外,检查 DEV 和 PROD 平台上的 FD_SETSIZE 值!

Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...

Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文