“无法解释”核心转储

发布于 2024-10-11 22:39:55 字数 1819 浏览 7 评论 0原文

我一生中见过很多核心转储,但这个让我难住了。

上下文:

  • 多线程 Linux/x86_64 程序在 AMD Barcelona CPU 集群上运行,
  • 崩溃的代码执行 很多
  • 在负载下运行程序的1000个实例(完全相同的优化二进制文件)每小时会产生1-2次崩溃
  • 崩溃发生在不同的机器上(但机器本身非常相同)
  • 崩溃看起来都一样(相同的确切地址,相同的调用堆栈)

以下是崩溃的详细信息:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

您会注意到崩溃发生在 0x17bd9fc 指令的中间,即返回之后从 0x17bd9f6 处调用虚拟函数。

当我检查虚拟表时,我发现它没有以任何方式损坏:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

并且它指向这个简单的函数(正如通过查看源代码所预期的那样):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

此外,当我查看 Foo1 的返回地址时::Get() 应该返回到:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向正确的指令,所以就好像在从 Foo1::Get() 返回期间,出现了一些小鬼并将 %rip 增加 4。

合理的解释?

I've seen many core dumps in my life, but this one has me stumped.

Context:

  • multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
  • the code that crashes is executed a lot
  • running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
  • the crashes happen on different machines (but the machines themselves are pretty identical)
  • the crashes all look the same (same exact address, same call stack)

Here are the details of the crash:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function.

When I examine the virtual table, I see that it is not corrupted in any way:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

and that it points to this trivial function (as expected by looking at the source):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

Further, when I look at the return address that Foo1::Get() should have returned to:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4.

Plausible explanations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一刻暧昧 2024-10-18 22:39:55

因此,尽管看起来不太可能,但我们似乎遇到了真正的 CPU 错误。

https://web.archive .org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf 有勘误#721:

721 处理器可能会错误地更新堆栈指针

描述

在一组高度具体和详细的​​内部时序条件下,处理器可能会在执行一长串压入和/或近调用指令,或一长串弹出和/或近返回指令后错误地更新堆栈指针。处理器必须处于 64 位模式才能出现此错误。

对系统的潜在影响

堆栈指针值沿正向或负向跳跃大约 1024 个值。这种不正确的堆栈指针会导致不可预测的程序或系统行为,通常被视为程序异常或崩溃(例如,#GP 或#UD)。

建议的解决方法

系统软件可以设置MSRC001_1029[0] = 1b。

So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.

https://web.archive.org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:

721 Processor May Incorrectly Update Stack Pointer

Description

Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the stack pointer after a long series of push and/or near-call instructions, or a long series of pop and/or near-return instructions. The processor must be in 64-bit mode for this erratum to occur.

Potential Effect on System

The stack pointer value jumps by a value of approximately 1024, either in the positive or negative direction. This incorrect stack pointer causes unpredictable program or system behavior, usually observed as a program exception or crash (for example, a #GP or #UD).

Suggested Workaround

System software may set MSRC001_1029[0] = 1b.

拥有 2024-10-18 22:39:55

我曾经见过“非法操作码”在指令中间崩溃。我当时正在开发 Linux 端口。长话短说,Linux 从指令指针中减去以便重新启动系统调用,在我的例子中,这种情况发生了两次(如果两个信号同时到达)。

所以这就是一个可能的罪魁祸首:内核摆弄你的指令指针。您的情况可能还有其他原因。

请记住,有时处理器会将其正在处理的数据理解为指令,即使它不应该如此。因此,处理器可能执行了 0x17bd9fa 处的“指令”,然后移至 0x17bd9fd,然后生成非法操作码异常。 (我刚刚编了这个数字,但是使用反汇编程序进行实验可以向您展示处理器可能“进入”指令流的位置。)

调试愉快!

I've once seen an "illegal opcode" crash right in the middle of an instruction. I was working on a Linux port. Long story short, Linux subtracts from the instruction pointer in order to restart a syscall, and in my case this was happening twice (if two signals arrived at the same time).

So that's one possible culprit: the kernel fiddling with your instruction pointer. There may be some other cause in your case.

Bear in mind that sometimes the processor will understand the data it's processing as an instruction, even when it's not supposed to be. So the processor may have executed the "instruction" at 0x17bd9fa and then moved on to 0x17bd9fd and then generated an illegal opcode exception. (I just made that number up, but experimenting with a disassembler can show you where the processor might have "entered" the instruction stream.)

Happy debugging!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文