“无法解释”核心转储

发布于 2024-10-11 22:39:55 字数 1819 浏览 7 评论 0原文

我一生中见过很多核心转储，但这个让我难住了。

上下文：

多线程 Linux/x86_64 程序在 AMD Barcelona CPU 集群上运行，
崩溃的代码执行很多
在负载下运行程序的1000个实例（完全相同的优化二进制文件）每小时会产生1-2次崩溃
崩溃发生在不同的机器上（但机器本身非常相同）
崩溃看起来都一样（相同的确切地址，相同的调用堆栈）

以下是崩溃的详细信息：

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

您会注意到崩溃发生在 0x17bd9fc 指令的中间，即返回之后从 0x17bd9f6 处调用虚拟函数。

当我检查虚拟表时，我发现它没有以任何方式损坏：

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

并且它指向这个简单的函数（正如通过查看源代码所预期的那样）：

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

此外，当我查看 Foo1 的返回地址时::Get() 应该返回到：

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向正确的指令，所以就好像在从 Foo1::Get() 返回期间，出现了一些小鬼并将 %rip 增加 4。

合理的解释？

原文

I've seen many core dumps in my life, but this one has me stumped.

Context:

multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
the code that crashes is executed a lot
running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
the crashes happen on different machines (but the machines themselves are pretty identical)
the crashes all look the same (same exact address, same call stack)

Here are the details of the crash:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function.

When I examine the virtual table, I see that it is not corrupted in any way:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

and that it points to this trivial function (as expected by looking at the source):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

Further, when I look at the return address that Foo1::Get() should have returned to:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4.

Plausible explanations?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一刻暧昧 2024-10-18 22:39:55

因此，尽管看起来不太可能，但我们似乎遇到了真正的 CPU 错误。

https://web.archive .org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 有勘误#721：

721 处理器可能会错误地更新堆栈指针
描述
在一组高度具体和详细的内部时序条件下，处理器可能会在执行一长串压入和/或近调用指令，或一长串弹出和/或近返回指令后错误地更新堆栈指针。处理器必须处于 64 位模式才能出现此错误。
对系统的潜在影响
堆栈指针值沿正向或负向跳跃大约 1024 个值。这种不正确的堆栈指针会导致不可预测的程序或系统行为，通常被视为程序异常或崩溃（例如，#GP 或#UD）。
建议的解决方法
系统软件可以设置MSRC001_1029[0] = 1b。