这是“不应该发生的事情”吗? AMD Fusion CPU 崩溃错误?
我的公司已经开始接到许多客户的电话,因为我们的程序因他们的系统访问冲突而崩溃。
崩溃发生在 SQLite 3.6.23.1 中,我们将其作为应用程序的一部分提供。 (我们发布了一个自定义构建,以便使用与应用程序其余部分相同的 VC++ 库,但它是库存 SQLite 代码。)
当 pcache1Fetch
执行 call 00000000
时发生崩溃code>,如 WinDbg 调用堆栈所示:
0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]
相关的 C 代码行是:
if( createFlag==1 ) sqlite3BeginBenignMalloc();
编译器内联 sqlite3BeginBenignMalloc
,其定义为
typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
void (*xBenignBegin)(void);
void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };
# define wsdHooksInit
# define wsdHooks sqlite3Hooks
SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
wsdHooksInit;
if( wsdHooks.xBenignBegin ){
wsdHooks.xBenignBegin();
}
}
:其汇编为:
719f9f99 mov esi,dword ptr [esp+1Ch]
719f9f9d cmp esi,1
719f9fa0 jne SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2 mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7 test eax,eax
719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab call eax ; *** CRASH HERE ***
719f9fad mov ebx,dword ptr [esp+14h]
寄存器为:
eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
如果 eax
为 0(确实如此),则应通过 test eax, eax
设置零标志,但它不是零。由于未设置零标志,je
不会跳转,然后应用在尝试执行 call eax (00000000)
时崩溃。
更新:eax
此处应始终为 0,因为 sqlite3Hooks.xBenignBegin
未在我们的代码构建中设置。我可以使用定义的 SQLITE_OMIT_BUILTIN_TEST
来重建 SQLite,这将在代码中打开 #define sqlite3BeginBenignMalloc()
并完全省略此代码路径。这可能会解决问题,但感觉并不是“真正”的解决方案;什么会阻止它在其他代码路径中发生?
到目前为止,共同的因素是所有客户都运行“Windows 7 Home Premium 64位(6.1,Build 7601)Service Pack 1”并拥有以下CPU之一(根据DxDiag):
- AMD A6-3400M APU with Radeon( tm) 高清显卡 (4 个 CPU),~1.4GHz
- AMD A8-3500M APU,带 Radeon(tm) 高清显卡 (4 CPU),~1.5GHz
- AMD A8-3850 APU,带 Radeon(tm) HD 显卡(4 个 CPU),~2.9GHz
根据维基百科的 AMD Fusion文章,这些都是基于K10核心的“Llano”型号AMD Fusion芯片,于2011年6月发布,这是我们第一次开始收到报告的时间。
最常见的客户系统是东芝 Satellite L775D,但我们也有来自 HP Pavilion dv6 和 HP Pavilion dv6 的崩溃报告。 dv7 和网关系统。
此崩溃是否是由 CPU 错误引起的(请参阅 AMD 系列 12h 处理器勘误表),或者是否存在我忽略了其他一些可能的解释? (根据 Raymond 的说法,它可能会超频,但奇怪的是如果是的话,只是这个特定的 CPU 型号受到影响。)
老实说,这似乎不可能是真正的 CPU 或操作系统错误,因为客户在其他型号中没有遇到蓝屏或崩溃应用程序。一定还有其他更可能的解释——但是什么?
8 月 15 日更新:我购买了一台配备 AMD A6-3400M 处理器的 Toshiba L745D 笔记本电脑,并且在运行程序时能够一致地重现崩溃情况。崩溃总是发生在同一条指令上; .time
报告崩溃前 1 分 30 秒到 7 分钟的用户时间。我在原博文中忽略提及的一个事实(可能与该问题相关)是该应用程序是多线程的,并且 CPU 和 I/O 使用率很高。该应用程序默认生成四个工作线程,并显示 80% 以上的 CPU 使用率(I/O 以及 SQLite 代码中的互斥锁存在一些阻塞),直到崩溃。我将应用程序修改为仅使用两个线程,但它仍然崩溃(尽管发生的时间更长)。我现在正在仅使用一个线程运行测试,并且它尚未崩溃。
另请注意,这似乎并不纯粹是 CPU 负载问题;我可以在系统上毫无错误地运行 Prime95,它会将 CPU 温度提高到 > 70°C,而我的应用程序在运行时几乎没有超过 50°C 的温度。
8 月 16 日更新:稍微扰乱说明即可使问题“消失”。例如,用 xor eax, eax
替换内存负载 (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
) 可防止崩溃。修改原始 C 代码以向 if( createFlag==1 )
语句添加额外检查,从而更改编译代码中各种跳转的相对偏移量(以及 test 的位置) eax、eax
和 call eax
语句),似乎也可以防止该问题。
到目前为止,我发现的最奇怪的结果是将 719f9fa0
处的 jne
更改为两个 nop
指令(以便控制始终< /em> 会执行 test eax, eax
指令,无论 createFlag
/esi
的值是什么)允许程序运行没有崩溃。
My company has started having a number of customers call in because our program is crashing with an access violation on their systems.
The crash happens in SQLite 3.6.23.1, which we ship as part of our application. (We ship a custom build, in order to use the same VC++ libraries as the rest of the app, but it's the stock SQLite code.)
The crash happens when pcache1Fetch
executes call 00000000
, as shown by the WinDbg callstack:
0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]
The relevant line of C code is:
if( createFlag==1 ) sqlite3BeginBenignMalloc();
The compiler inlines sqlite3BeginBenignMalloc
, which is defined as:
typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
void (*xBenignBegin)(void);
void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };
# define wsdHooksInit
# define wsdHooks sqlite3Hooks
SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
wsdHooksInit;
if( wsdHooks.xBenignBegin ){
wsdHooks.xBenignBegin();
}
}
And the assembly for this is:
719f9f99 mov esi,dword ptr [esp+1Ch]
719f9f9d cmp esi,1
719f9fa0 jne SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2 mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7 test eax,eax
719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab call eax ; *** CRASH HERE ***
719f9fad mov ebx,dword ptr [esp+14h]
The registers are:
eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
If eax
is 0 (which it is), the zero flag should be set by test eax, eax
, but it's non-zero. Because the zero flag isn't set, je
doesn't jump, and then the app crashes trying to execute call eax (00000000)
.
Update: eax
should always be 0 here because sqlite3Hooks.xBenignBegin
is not set in our build of the code. I could rebuild SQLite with SQLITE_OMIT_BUILTIN_TEST
defined, which would turn on #define sqlite3BeginBenignMalloc()
in the code and omit this code path entirely. That may solve the issue, but it doesn't feel like a "real" fix; what would stop it happening in some other code path?
So far the common factor is that all customers are running "Windows 7 Home Premium 64-bit (6.1, Build 7601) Service Pack 1" and have one of the following CPUs (according to DxDiag):
- AMD A6-3400M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.4GHz
- AMD A8-3500M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.5GHz
- AMD A8-3850 APU with Radeon(tm) HD Graphics (4 CPUs), ~2.9GHz
According to Wikipedia's AMD Fusion article, these are all "Llano" model AMD Fusion chips based on the K10 core and were released in June 2011, which is when we first started getting reports.
The most common customer system is the Toshiba Satellite L775D, but we also have crash reports from HP Pavilion dv6 & dv7 and Gateway systems.
Could this crash be caused by a CPU error (see Errata for AMD Family 12h Processors), or is there some other possible explanation that I'm overlooking? (According to Raymond, it could be overclocking, but it's odd that just this specific CPU model is affected, if so.)
Honestly, it doesn't seem possible that it's really a CPU or OS error, because the customers aren't getting bluescreens or crashes in other applications. There must be some other, more likely, explanation--but what?
Update 15 August: I've acquired a Toshiba L745D notebook with an AMD A6-3400M processor and can reproduce the crash consistently when running the program. The crash is always on the same instruction; .time
reports anywhere from 1m30s to 7m of user time before the crash. One fact (that may be pertinent to the issue) that I neglected to mention in the original post is that the application is multi-threaded and has both high CPU and I/O usage. The application spawns four worker threads by default and posts 80+% CPU usage (there is some blocking for I/O as well as for mutexes in the SQLite code) until it crashes. I modified the application to only use two threads, and it still crashed (although it took longer to happen). I'm now running a test with just one thread, and it hasn't crashed yet.
Note also that it doesn't appear to be purely a CPU load problem; I can run Prime95 without errors on the system and it will boost the CPU temperature to >70°C, while my application barely gets the temperature above 50°C while it's running.
Update 16 August: Perturbing the instructions slightly makes the problem "go away". For eaxmple, replacing the memory load (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
) with xor eax, eax
prevents the crash. Modifying the original C code to add an extra check to the if( createFlag==1 )
statement changes the relative offsets of various jumps in the compiled code (as well as the location of the test eax, eax
and call eax
statements) and also seems to prevent the problem.
The strangest result I've found so far is that changing the jne
at 719f9fa0
to two nop
instructions (so that control always falls through to the test eax, eax
instruction, no matter what the value of createFlag
/esi
is) allows the program to run without crashing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我在 Microsoft Build 大会上与一位 AMD 工程师讨论了这个错误,并向他展示了我的重现。今天早上他给我发了一封电子邮件:
以下是对该勘误的描述:
665 Integer Divide instructions May Cause Unpredictable Behaviour
描述
如下一组高度具体且详细的内部时序条件,处理器内核可能会中止推测的 DIV 或 IDIV 整数除法指令(由于推测执行被重定向,例如由于错误预测的分支),但可能挂起或过早完成第一条指令的非投机路径。
对系统的潜在影响
不可预测的系统行为,通常会导致系统挂起。
建议的解决方法
BIOS 应设置 MSRC001_1029[31]。
此解决方法会更改《AMD 系列 10h 和 12h 处理器软件优化指南》中指定的 DIV/IDIV 指令延迟,订单编号为 40546。应用此解决方法后,AMD 系列 12h 处理器的 DIV/IDIV 延迟为类似于 AMD 系列 10h 处理器的 DIV/IDIV 延迟。
修复计划
否
I spoke to an AMD engineer at the Microsoft Build conference about this error, and showed him my repro. He emailed me this morning:
Here's the description of that erratum:
665 Integer Divide Instruction May Cause Unpredictable Behavior
Description
Under a highly specific and detailed set of internal timing conditions, the processor core may abort a speculative DIV or IDIV integer divide instruction (due to the speculative execution being redirected, for example due to a mispredicted branch) but may hang or prematurely complete the first instruction of the non-speculative path.
Potential Effect on System
Unpredictable system behavior, usually resulting in a system hang.
Suggested Workaround
BIOS should set MSRC001_1029[31].
This workaround alters the DIV/IDIV instruction latency specified in the Software Optimization Guide for AMD Family 10h and 12h Processors, order# 40546. With this workaround applied, the DIV/IDIV latency for AMD Family 12h Processors are similar to the DIV/IDIV latency for AMD Family 10h Processors.
Fix Planned
No
我有点担心为if (wsdHooks.xBenignBegin) 生成的代码不是很通用。它假设唯一的真实值是
1
,而它实际上应该测试任何非零值。尽管如此,MSVC 有时还是会令人困惑。可能没什么。没关系:这些说明适用于未提供的
C
代码。鉴于 eflag
Z
位已清除并且EAX
为零,代码不会通过执行指令到达这里必须从其他地方跳转到后面的指令(
719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d
) 甚至是call
指令本身。另一个复杂之处是,对于 x86 系列,无效的跳转目标(例如 JE 指令的第二个字节)很常见,可以不受干扰地执行相当多的指令(没有错误),通常最终会得到回到正确的指令对齐。换句话说,您可能不会寻找跳转到任何这些指令开头的跳转:跳转可能位于它们的字节中间,从而导致执行诸如
add [al+ebp],al< 等不起眼的操作/code> 这往往不会被注意到。
我预测
test
指令处的断点不会因异常而被命中。找到这些原因的唯一方法要么是非常幸运,要么怀疑一切并一一证明它们是无辜的。I'm a bit concerned that the code generated forNever mind: these instructions are forif (wsdHooks.xBenignBegin)
isn't very general. It assumes the only true value is1
whereas it should really be testing for any nonzero value. Still, MSVC is sometimes baffling that way. It is probably nothing.C
code not presented.Given that the eflag
Z
bit is clear andEAX
is zero, the code did not get here by executing the instructionThere must be a jump from somewhere else to the instruction following (
719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d
) or even thecall
instruction itself.Another complication is that with the x86 family, it is common for an invalid jump target (like the second byte of the
JE
instruction) to execute unperturbed (no faults) for quite a few instructions, often eventually getting back on the proper instruction alignment. Said another way, you may not be looking for a jump to the beginning of any of these instructions: a jump might be in the midst of their bytes, resulting in executing unremarkable operations likeadd [al+ebp],al
which tend not to be noticed.I predict that a breakpoint at the
test
instruction will not be hit for the exception. The only ways to find such causes is either to be very lucky, or to suspect everything and prove them innocent one-by-one.在考虑 CPU bug 的可能性之前,请尝试排除更可能的原因
调用指令的不同代码路径。使用
uf
命令反汇编函数并查找到调用指令的其他跳转/分支从钩子函数跳转/调用0。
dps SQLite_Interop!sqlite3Hooks l 2
并验证它是否显示空值。Before considering the possibility of a CPU bug, try to rule out the more probable causes
A different code path to the call instruction. Use the
uf
command to disassemble the function and look for other jumps / branches to the call instructionJump / call to 0 from the hook function.
dps SQLite_Interop!sqlite3Hooks l 2
and verify that it shows nulls.