这是“不应该发生的事情”吗？ AMD Fusion CPU 崩溃错误？

发布于 2024-11-28 12:42:29 字数 4917 浏览 8 评论 0原文

我的公司已经开始接到许多客户的电话，因为我们的程序因他们的系统访问冲突而崩溃。

崩溃发生在 SQLite 3.6.23.1 中，我们将其作为应用程序的一部分提供。（我们发布了一个自定义构建，以便使用与应用程序其余部分相同的 VC++ 库，但它是库存 SQLite 代码。）

当 pcache1Fetch 执行 call 00000000 时发生崩溃code>，如 WinDbg 调用堆栈所示：

0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]

相关的 C 代码行是：

if( createFlag==1 ) sqlite3BeginBenignMalloc();

编译器内联 sqlite3BeginBenignMalloc，其定义为

typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
  void (*xBenignBegin)(void);
  void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };

# define wsdHooksInit
# define wsdHooks sqlite3Hooks

SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
  wsdHooksInit;
  if( wsdHooks.xBenignBegin ){
    wsdHooks.xBenignBegin();
  }
}

：其汇编为：

719f9f99    mov     esi,dword ptr [esp+1Ch]
719f9f9d    cmp     esi,1
719f9fa0    jne     SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2    mov     eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7    test    eax,eax
719f9fa9    je      SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab    call    eax ; *** CRASH HERE ***
719f9fad    mov     ebx,dword ptr [esp+14h]

寄存器为：

eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202

如果 eax 为 0（确实如此），则应通过 test eax, eax 设置零标志，但它不是零。由于未设置零标志，je 不会跳转，然后应用在尝试执行 call eax (00000000) 时崩溃。

更新：eax 此处应始终为 0，因为 sqlite3Hooks.xBenignBegin 未在我们的代码构建中设置。我可以使用定义的 SQLITE_OMIT_BUILTIN_TEST 来重建 SQLite，这将在代码中打开 #define sqlite3BeginBenignMalloc() 并完全省略此代码路径。这可能会解决问题，但感觉并不是“真正”的解决方案；什么会阻止它在其他代码路径中发生？

到目前为止，共同的因素是所有客户都运行“Windows 7 Home Premium 64位（6.1，Build 7601）Service Pack 1”并拥有以下CPU之一（根据DxDiag）：

AMD A6-3400M APU with Radeon（ tm) 高清显卡 (4 个 CPU)，~1.4GHz
AMD A8-3500M APU，带 Radeon(tm) 高清显卡 (4 CPU），~1.5GHz
AMD A8-3850 APU，带 Radeon(tm) HD 显卡（4 个 CPU），~2.9GHz

根据维基百科的 AMD Fusion文章，这些都是基于K10核心的“Llano”型号AMD Fusion芯片，于2011年6月发布，这是我们第一次开始收到报告的时间。

最常见的客户系统是东芝 Satellite L775D，但我们也有来自 HP Pavilion dv6 和 HP Pavilion dv6 的崩溃报告。 dv7 和网关系统。

此崩溃是否是由 CPU 错误引起的（请参阅 AMD 系列 12h 处理器勘误表），或者是否存在我忽略了其他一些可能的解释？（根据 Raymond 的说法，它可能会超频，但奇怪的是如果是的话，只是这个特定的 CPU 型号受到影响。）

老实说，这似乎不可能是真正的 CPU 或操作系统错误，因为客户在其他型号中没有遇到蓝屏或崩溃应用程序。一定还有其他更可能的解释——但是什么？

8 月 15 日更新：我购买了一台配备 AMD A6-3400M 处理器的 Toshiba L745D 笔记本电脑，并且在运行程序时能够一致地重现崩溃情况。崩溃总是发生在同一条指令上； .time 报告崩溃前 1 分 30 秒到 7 分钟的用户时间。我在原博文中忽略提及的一个事实（可能与该问题相关）是该应用程序是多线程的，并且 CPU 和 I/O 使用率很高。该应用程序默认生成四个工作线程，并显示 80% 以上的 CPU 使用率（I/O 以及 SQLite 代码中的互斥锁存在一些阻塞），直到崩溃。我将应用程序修改为仅使用两个线程，但它仍然崩溃（尽管发生的时间更长）。我现在正在仅使用一个线程运行测试，并且它尚未崩溃。

另请注意，这似乎并不纯粹是 CPU 负载问题；我可以在系统上毫无错误地运行 Prime95，它会将 CPU 温度提高到 > 70°C，而我的应用程序在运行时几乎没有超过 50°C 的温度。

8 月 16 日更新：稍微扰乱说明即可使问题“消失”。例如，用 xor eax, eax 替换内存负载 (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]) 可防止崩溃。修改原始 C 代码以向 if( createFlag==1 ) 语句添加额外检查，从而更改编译代码中各种跳转的相对偏移量（以及 test 的位置） eax、eax 和 call eax 语句），似乎也可以防止该问题。

到目前为止，我发现的最奇怪的结果是将 719f9fa0 处的 jne 更改为两个 nop 指令（以便控制始终< /em> 会执行 test eax, eax 指令，无论 createFlag/esi 的值是什么）允许程序运行没有崩溃。

原文

My company has started having a number of customers call in because our program is crashing with an access violation on their systems.

The crash happens in SQLite 3.6.23.1, which we ship as part of our application. (We ship a custom build, in order to use the same VC++ libraries as the rest of the app, but it's the stock SQLite code.)

The crash happens when pcache1Fetch executes call 00000000, as shown by the WinDbg callstack:

0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]

The relevant line of C code is:

if( createFlag==1 ) sqlite3BeginBenignMalloc();

The compiler inlines sqlite3BeginBenignMalloc, which is defined as:

typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
  void (*xBenignBegin)(void);
  void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };

# define wsdHooksInit
# define wsdHooks sqlite3Hooks

SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
  wsdHooksInit;
  if( wsdHooks.xBenignBegin ){
    wsdHooks.xBenignBegin();
  }
}

And the assembly for this is:

719f9f99    mov     esi,dword ptr [esp+1Ch]
719f9f9d    cmp     esi,1
719f9fa0    jne     SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2    mov     eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7    test    eax,eax
719f9fa9    je      SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab    call    eax ; *** CRASH HERE ***
719f9fad    mov     ebx,dword ptr [esp+14h]

The registers are:

eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202

If eax is 0 (which it is), the zero flag should be set by test eax, eax, but it's non-zero. Because the zero flag isn't set, je doesn't jump, and then the app crashes trying to execute call eax (00000000).

Update: eax should always be 0 here because sqlite3Hooks.xBenignBegin is not set in our build of the code. I could rebuild SQLite with SQLITE_OMIT_BUILTIN_TEST defined, which would turn on #define sqlite3BeginBenignMalloc() in the code and omit this code path entirely. That may solve the issue, but it doesn't feel like a "real" fix; what would stop it happening in some other code path?

So far the common factor is that all customers are running "Windows 7 Home Premium 64-bit (6.1, Build 7601) Service Pack 1" and have one of the following CPUs (according to DxDiag):

AMD A6-3400M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.4GHz
AMD A8-3500M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.5GHz
AMD A8-3850 APU with Radeon(tm) HD Graphics (4 CPUs), ~2.9GHz

According to Wikipedia's AMD Fusion article, these are all "Llano" model AMD Fusion chips based on the K10 core and were released in June 2011, which is when we first started getting reports.

The most common customer system is the Toshiba Satellite L775D, but we also have crash reports from HP Pavilion dv6 & dv7 and Gateway systems.

Could this crash be caused by a CPU error (see Errata for AMD Family 12h Processors), or is there some other possible explanation that I'm overlooking? (According to Raymond, it could be overclocking, but it's odd that just this specific CPU model is affected, if so.)

Honestly, it doesn't seem possible that it's really a CPU or OS error, because the customers aren't getting bluescreens or crashes in other applications. There must be some other, more likely, explanation--but what?

Update 15 August: I've acquired a Toshiba L745D notebook with an AMD A6-3400M processor and can reproduce the crash consistently when running the program. The crash is always on the same instruction; .time reports anywhere from 1m30s to 7m of user time before the crash. One fact (that may be pertinent to the issue) that I neglected to mention in the original post is that the application is multi-threaded and has both high CPU and I/O usage. The application spawns four worker threads by default and posts 80+% CPU usage (there is some blocking for I/O as well as for mutexes in the SQLite code) until it crashes. I modified the application to only use two threads, and it still crashed (although it took longer to happen). I'm now running a test with just one thread, and it hasn't crashed yet.

Note also that it doesn't appear to be purely a CPU load problem; I can run Prime95 without errors on the system and it will boost the CPU temperature to >70°C, while my application barely gets the temperature above 50°C while it's running.

Update 16 August: Perturbing the instructions slightly makes the problem "go away". For eaxmple, replacing the memory load (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]) with xor eax, eax prevents the crash. Modifying the original C code to add an extra check to the if( createFlag==1 ) statement changes the relative offsets of various jumps in the compiled code (as well as the location of the test eax, eax and call eax statements) and also seems to prevent the problem.

The strangest result I've found so far is that changing the jne at 719f9fa0 to two nop instructions (so that control always falls through to the test eax, eax instruction, no matter what the value of createFlag/esi is) allows the program to run without crashing.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幼儿园老大 2024-12-05 12:42:30

我在 Microsoft Build 大会上与一位 AMD 工程师讨论了这个错误，并向他展示了我的重现。今天早上他给我发了一封电子邮件：

我们进行了调查并发现这是由于已知的勘误表造成的
Llano APU 系列。它可以通过 BIOS 更新来修复，具体取决于
OEM – 如果可能，请将其推荐给您的客户（甚至
尽管您有解决方法）。
如果您有兴趣，Family 12h 中的勘误表为 665
修订指南（参见第 45 页）：
http://support.amd.com/TechDocs/44739_12h_Rev_Gd.pdf#page=45

以下是对该勘误的描述：

665 Integer Divide instructions May Cause Unpredictable Behaviour

描述

如下一组高度具体且详细的内部时序条件，处理器内核可能会中止推测的 DIV 或 IDIV 整数除法指令（由于推测执行被重定向，例如由于错误预测的分支），但可能挂起或过早完成第一条指令的非投机路径。

对系统的潜在影响

不可预测的系统行为，通常会导致系统挂起。

建议的解决方法

BIOS 应设置 MSRC001_1029[31]。

此解决方法会更改《AMD 系列 10h 和 12h 处理器软件优化指南》中指定的 DIV/IDIV 指令延迟，订单编号为 40546。应用此解决方法后，AMD 系列 12h 处理器的 DIV/IDIV 延迟为类似于 AMD 系列 10h 处理器的 DIV/IDIV 延迟。

修复计划

否

回复收藏 0 原文

ら栖息 2024-12-05 12:42:30

我有点担心为 if (wsdHooks.xBenignBegin) 生成的代码不是很通用。它假设唯一的真实值是1，而它实际上应该测试任何非零值。尽管如此，MSVC 有时还是会令人困惑。可能没什么。没关系：这些说明适用于未提供的 C 代码。

鉴于 eflag Z 位已清除并且 EAX 为零，代码不会通过执行指令到达这里

719f9fa7    test    eax,eax

必须从其他地方跳转到后面的指令（ 719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d) 甚至是 call 指令本身。

另一个复杂之处是，对于 x86 系列，无效的跳转目标（例如 JE 指令的第二个字节）很常见，可以不受干扰地执行相当多的指令（没有错误），通常最终会得到回到正确的指令对齐。换句话说，您可能不会寻找跳转到任何这些指令开头的跳转：跳转可能位于它们的字节中间，从而导致执行诸如 add [al+ebp],al< 等不起眼的操作/code> 这往往不会被注意到。

我预测 test 指令处的断点不会因异常而被命中。找到这些原因的唯一方法要么是非常幸运，要么怀疑一切并一一证明它们是无辜的。

I'm a bit concerned that the code generated for if (wsdHooks.xBenignBegin) isn't very general. It assumes the only true value is 1 whereas it should really be testing for any nonzero value. Still, MSVC is sometimes baffling that way. It is probably nothing. Never mind: these instructions are for C code not presented.

Given that the eflag Z bit is clear and EAX is zero, the code did not get here by executing the instruction

719f9fa7    test    eax,eax

There must be a jump from somewhere else to the instruction following (719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d) or even the call instruction itself.

Another complication is that with the x86 family, it is common for an invalid jump target (like the second byte of the JE instruction) to execute unperturbed (no faults) for quite a few instructions, often eventually getting back on the proper instruction alignment. Said another way, you may not be looking for a jump to the beginning of any of these instructions: a jump might be in the midst of their bytes, resulting in executing unremarkable operations like add [al+ebp],al which tend not to be noticed.

I predict that a breakpoint at the test instruction will not be hit for the exception. The only ways to find such causes is either to be very lucky, or to suspect everything and prove them innocent one-by-one.

回复收藏 0 原文