“无法解释”核心转储
我一生中见过很多核心转储,但这个让我难住了。
上下文:
- 多线程 Linux/x86_64 程序在 AMD Barcelona CPU 集群上运行,
- 崩溃的代码执行 很多
- 在负载下运行程序的1000个实例(完全相同的优化二进制文件)每小时会产生1-2次崩溃
- 崩溃发生在不同的机器上(但机器本身非常相同)
- 崩溃看起来都一样(相同的确切地址,相同的调用堆栈)
以下是崩溃的详细信息:
Program terminated with signal 11, Segmentation fault.
#0 0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)
(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d
0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>
您会注意到崩溃发生在 0x17bd9fc
指令的中间,即返回之后从 0x17bd9f6
处调用虚拟函数。
当我检查虚拟表时,我发现它没有以任何方式损坏:
(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>
并且它指向这个简单的函数(正如通过查看源代码所预期的那样):
(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
0x0000000002d3d7b0 <+0>: push %rbp
0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax
0x0000000002d3d7b4 <+4>: mov %rsp,%rbp
0x0000000002d3d7b7 <+7>: leaveq
0x0000000002d3d7b8 <+8>: retq
End of assembler dump.
此外,当我查看 Foo1 的返回地址时::Get() 应该返回到:
(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>
我看到它指向正确的指令,所以就好像在从 Foo1::Get()
返回期间,出现了一些小鬼并将 %rip
增加 4。
合理的解释?
I've seen many core dumps in my life, but this one has me stumped.
Context:
- multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
- the code that crashes is executed a lot
- running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
- the crashes happen on different machines (but the machines themselves are pretty identical)
- the crashes all look the same (same exact address, same call stack)
Here are the details of the crash:
Program terminated with signal 11, Segmentation fault.
#0 0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)
(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d
0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>
You'll notice that the crash happened in the middle of instruction at 0x17bd9fc
, which is after return from a call at 0x17bd9f6
to a virtual function.
When I examine the virtual table, I see that it is not corrupted in any way:
(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>
and that it points to this trivial function (as expected by looking at the source):
(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
0x0000000002d3d7b0 <+0>: push %rbp
0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax
0x0000000002d3d7b4 <+4>: mov %rsp,%rbp
0x0000000002d3d7b7 <+7>: leaveq
0x0000000002d3d7b8 <+8>: retq
End of assembler dump.
Further, when I look at the return address that Foo1::Get()
should have returned to:
(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>
I see that it points to the right instruction, so it's as if during the return from Foo1::Get()
, some gremlin came along and incremented %rip
by 4.
Plausible explanations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
因此,尽管看起来不太可能,但我们似乎遇到了真正的 CPU 错误。
https://web.archive .org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 有勘误#721:
So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.
https://web.archive.org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:
我曾经见过“非法操作码”在指令中间崩溃。我当时正在开发 Linux 端口。长话短说,Linux 从指令指针中减去以便重新启动系统调用,在我的例子中,这种情况发生了两次(如果两个信号同时到达)。
所以这就是一个可能的罪魁祸首:内核摆弄你的指令指针。您的情况可能还有其他原因。
请记住,有时处理器会将其正在处理的数据理解为指令,即使它不应该如此。因此,处理器可能执行了 0x17bd9fa 处的“指令”,然后移至 0x17bd9fd,然后生成非法操作码异常。 (我刚刚编了这个数字,但是使用反汇编程序进行实验可以向您展示处理器可能“进入”指令流的位置。)
调试愉快!
I've once seen an "illegal opcode" crash right in the middle of an instruction. I was working on a Linux port. Long story short, Linux subtracts from the instruction pointer in order to restart a syscall, and in my case this was happening twice (if two signals arrived at the same time).
So that's one possible culprit: the kernel fiddling with your instruction pointer. There may be some other cause in your case.
Bear in mind that sometimes the processor will understand the data it's processing as an instruction, even when it's not supposed to be. So the processor may have executed the "instruction" at 0x17bd9fa and then moved on to 0x17bd9fd and then generated an illegal opcode exception. (I just made that number up, but experimenting with a disassembler can show you where the processor might have "entered" the instruction stream.)
Happy debugging!