当 gdb 堆栈跟踪充满“??”时,如何调试分段错误?
我的可执行文件包含符号表。但堆栈跟踪似乎被覆盖了。
请问如何从该核心获取更多信息?例如,有没有办法检查堆?查看填充堆的对象实例以获得一些线索。无论如何,任何想法都会受到赞赏。
My executable contains symbol table. But it seems that the stack trace is overwrited.
How to get more information out of that core please? For instance, is there a way to inspect the heap ? See the objects instances populating the heap to get some clues. Whatever, any idea is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我是一名 C++ 程序员,我遇到这个问题的次数比我愿意承认的还要多。您的应用程序正在破坏堆栈的很大一部分。有可能破坏堆栈的函数在返回时也会崩溃。原因是因为返回地址已被覆盖,这就是GDB的堆栈跟踪混乱的原因。
这就是我调试此问题的方法:
1)单步执行应用程序,直到它崩溃。 (查找返回时崩溃的函数)。
2)一旦你确定了函数,就在函数的第一行声明一个变量:(
它必须是第一行的原因是这个值必须位于堆栈的最顶部。这个“金丝雀”将在函数的返回地址之前被覆盖。)
3)在 canary 上放置一个变量监视,单步执行函数,当 canary!=0 时,您就发现了缓冲区溢出!另一种可能性是当 canary!=0 时放置一个变量断点,然后正常运行程序,这更容易一些,但并非所有 IDE 都支持变量断点。
编辑:我和我办公室的一位高级程序员交谈过,为了理解核心转储,您需要解析它所具有的内存地址。找出这些地址的一种方法是查看二进制文件的 MAP 文件,该文件是人类可读的。下面是一个使用 gcc 生成 MAP 文件的示例:
这是拼图的一部分,但仍然很难获得崩溃函数的地址。如果您在现代平台上运行此应用程序,那么 ASLR 可能会使核心转储中的地址变得无用。 ASLR 的某些实现会随机化二进制文件的函数地址,这使得核心转储绝对毫无价值。
I am a C++ programmer for a living and I have encountered this issue more times than i like to admit. Your application is smashing HUGE part of the stack. Chances are the function that is corrupting the stack is also crashing on return. The reason why is because the return address has been overwritten, and this is why GDB's stack trace is messed up.
This is how I debug this issue:
1)Step though the application until it crashes. (Look for a function that is crashing on return).
2)Once you have identified the function, declare a variable at the VERY FIRST LINE of the function:
(The reason why it must be the first line is that this value must be at the very top of the stack. This "canary" will be overwritten before the function's return address.)
3) Put a variable watch on canary, step though the function and when canary!=0, then you have found your buffer overflow! Another possibility it to put a variable breakpoint for when canary!=0 and just run the program normally, this is a little easier but not all IDE's support variable breakpoints.
EDIT: I have talked to a senior programmer at my office and in order to understand the core dump you need to resolve the memory addresses it has. One way to figure out these addresses is to look at the MAP file for the binary, which is human readable. Here is an example of generating a MAP file using gcc:
This is a piece of the puzzle, but it will still be very difficult to obtain the address of function that is crashing. If you are running this application on a modern platform then ASLR will probably make the addresses in the core dump useless. Some implementation of ASLR will randomize the function addresses of your binary which makes the core dump absolutely worthless.
例如:gcc -Wall -g -c -o oke.o oke.c
3. 确保您还有 -g 选项来生成调试信息。您可以使用一些宏来调用调试信息。以下宏对我来说非常有用:
__LINE__
:告诉您行__FILE__
:告诉您源文件__func__
:告诉您函数希望这会有所帮助
ex: gcc -Wall -g -c -o oke.o oke.c
3. Make sure you also have -g option to produce debugging information. You can call debugging information using some macros. The following macros are very useful for me:
__LINE__
: tells you the line__FILE__
: tells you the source file__func__
: tells yout the functionHope this would help
TL;DR:函数中非常大的局部变量声明是在堆栈上分配的,在某些平台和编译器组合上,可能会溢出并损坏堆栈。
只是为了添加此问题的另一个潜在原因。我最近正在调试一个非常相似的问题。使用应用程序和核心文件运行 gdb 会产生如下结果:
这是极其无益且令人失望的。经过几个小时的互联网搜索后,我找到了一个论坛,其中讨论了我们使用的特定编译器(英特尔编译器)的默认堆栈大小如何比其他编译器更小,并且大的局部变量可能会溢出并损坏堆栈。看看我们的代码,我找到了罪魁祸首:
}
宾果!我发现 MAX_BUFFER_SIZE 设置为 10000000,因此在堆栈上分配了 10MB 局部变量! 在更改实现以使用共享指针并动态创建缓冲区后,突然程序开始完美运行。
TL;DR: extremely large local variable declarations in functions are allocated on the stack, which, on certain platform and compiler combinations, can overrun and corrupt the stack.
Just to add another potential cause to this issue. I was recently debugging a very similar issue. Running gdb with the application and core file would produce results such as:
That was extremely unhelpful and disappointing. After hours of scouring the internet, I found a forum that talked about how the particular compiler we were using (Intel compiler) had a smaller default stack size than other compilers, and that large local variables could overrun and corrupt the stack. Looking at our code, I found the culprit:
}
Bingo! I found MAX_BUFFER_SIZE was set to 10000000, thus a 10MB local variable was being allocated on the stack! After changing the implementation to use a shared_ptr and create the buffer dynamically, suddenly the program started working perfectly.
尝试使用 Valgrind 内存调试器运行。
Try running with Valgrind memory debugger.
为了确认,您的可执行文件是否在发布模式下编译,即没有调试符号......这可以解释为什么有?尝试使用
-g
开关重新编译,该开关“包括调试信息并将其嵌入到可执行文件中”。除此之外,我不知道为什么你有“??”...To confirm, was your executable compiled in release mode, i.e. no debug symbols....that could explain why there's ?? Try recompiling with
-g
switch which 'includes debugging information and embedding it into the executable'..Other than that, I am out of ideas as to why you have '??'...并不真地。当然,你可以在记忆中挖掘并观察事物。但是如果没有堆栈跟踪,您将不知道如何到达当前位置或参数值是什么。
然而,堆栈已损坏的事实告诉您需要查找写入堆栈的代码。
如果您有 Unix 系统,“valgrind”是查找其中一些问题的好工具。
Not really. Sure you can dig around in memory and look at things. But without a stack trace you don't know how you got to where you are or what the parameter values were.
However, the very fact that your stack is corrupt tells you that you need to look for code that writes into the stack.
If you have a Unix system, "valgrind" is a good tool for finding some of these problems.
我假设既然你说“我的可执行文件包含符号表”,你用 -g 编译和链接,并且你的二进制文件没有被删除。
我们只能确认这一点:
strings -a |grep function_name_you_know_should_exist
还可以尝试在核心上使用 pstack,看看它是否可以更好地获取调用堆栈。在这种情况下,听起来你的 gdb 与你的 gcc/g++ 版本相比已经过时了。
I assume that since you say "My executable contains symbol table" that you compiled and linked with -g, and that your binary wasn't stripped.
We can just confirm this:
strings -a |grep function_name_you_know_should_exist
Also try using pstack on the core ans see if it does a better job of picking up the callstack. In that case it sounds like your gdb is out of date compared to your gcc/g++ version.
听起来你在机器上使用的 glibc 版本与生产崩溃时的 corefile 版本不同。获取“ldd ./appname”输出的文件并将它们加载到您的计算机上,然后告诉 gdb 去哪里查找;
Sounds like you're not using the identical glibc version on your machine as the corefile was when it crashed on production. Get the files output by "ldd ./appname" and load them onto your machine, then tell gdb where to look;