在 x86 Linux 上调试 SIGBUS
什么会导致 Linux 中的通用 x86 用户态应用程序出现 SIGBUS(总线错误)?我在网上找到的所有讨论都是关于内存对齐错误的,据我所知,这并不真正适用于 x86。
(我的代码在 Geode 上运行,以防有任何相关处理器 -那里有具体的怪癖。)
What can cause SIGBUS (bus error) on a generic x86 userland application in Linux? All of the discussion I've been able to find online is regarding memory alignment errors, which from what I understand doesn't really apply to x86.
(My code is running on a Geode, in case there are any relevant processor-specific quirks there.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
除了内存对齐错误之外,Linux 中发生 SIGBUS 的原因还有很多,例如,如果您尝试访问超出映射文件末尾的
mmap
区域。您是否使用诸如
mmap
、共享内存区域或类似的东西?SIGBUS
can happen in Linux for quite a few reasons other than memory alignment faults - for example, if you attempt to access anmmap
region beyond the end of the mapped file.Are you using anything like
mmap
, shared memory regions, or similar?如果打开未对齐访问陷阱,则可以从未对齐访问中获取 SIGBUS,但通常在 x86 上这是关闭的。如果出现某种错误,您还可以通过访问内存映射设备来获取它。
最好的选择是使用调试器来识别错误指令(SIGBUS 是同步的),并尝试查看它试图执行的操作。
You can get a SIGBUS from an unaligned access if you turn on the unaligned access trap, but normally that's off on an x86. You can also get it from accessing a memory mapped device if there's an error of some kind.
Your best bet is using a debugger to identify the faulting instruction (SIGBUS is synchronous), and trying to see what it was trying to do.
x86(包括 x86_64)Linux 上的 SIGBUS 是一种罕见的野兽。它可能是由于尝试访问超过
mmap
ed 文件的末尾或 POSIX 描述的其他情况而出现的。但由于硬件故障,获取SIGBUS并不容易。也就是说,来自任何指令(无论是否为 SIMD)的未对齐访问通常会导致 SIGSEGV。堆栈溢出会导致 SIGSEGV。即使访问不规范形式的地址也会导致 SIGSEGV。所有这一切都是由于 #GP 被引发,它几乎总是映射到 SIGSEGV。
现在,以下是由于 CPU 异常而获取 SIGBUS 的一些方法:
启用 EFLAGS 中的 AC 位,然后通过任何内存读取或写入指令进行未对齐访问。有关详细信息,请参阅此讨论。
通过堆栈指针寄存器(
rsp
或rbp
)违反规范,生成#SS。以下是 GCC 的示例(使用gcc test.c -o test -masm=intel
进行编译):SIGBUS on x86 (including x86_64) Linux is a rare beast. It may appear from attempt to access past the end of
mmap
ed file, or some other situations described by POSIX.But from hardware faults it's not easy to get SIGBUS. Namely, unaligned access from any instruction — be it SIMD or not — usually results in SIGSEGV. Stack overflows result in SIGSEGV. Even accesses to addresses not in canonical form result in SIGSEGV. All this due to #GP being raised, which almost always maps to SIGSEGV.
Now, here're some ways to get SIGBUS due to a CPU exception:
Enable AC bit in
EFLAGS
, then do unaligned access by any memory read or write instruction. See this discussion for details.Do canonical violation via a stack pointer register (
rsp
orrbp
), generating #SS. Here's an example for GCC (compile withgcc test.c -o test -masm=intel
):哦,是的,还有一种更奇怪的方式来获取 SIGBUS。
如果内核由于内存压力(必须禁用 OOM Killer)或失败的 IO 请求而无法调入代码页,则 SIGBUS。
Oh yes there's one more weird way to get SIGBUS.
If the kernel fails to page in a code page due to memory pressure (OOM killer must be disabled) or failed IO request, SIGBUS.
当您在 NFS(网络文件系统)上运行二进制文件并且文件发生更改时,您可能会看到 SIGBUS。请参阅https://rachelbythebay.com/w/2018/03/15/core/< /a>.
You may see SIGBUS when you're running the binary off NFS (network file system) and the file is changed. See https://rachelbythebay.com/w/2018/03/15/core/.
上面简要提到了这是“失败的 IO 请求”,但我将对此进行一些扩展。
常见的情况是,当您使用 ftruncate 延迟增长文件,将其映射到内存中,开始写入数据,然后耗尽文件系统中的空间。映射文件的物理空间是在页面错误时分配的,如果没有剩余,则进程会收到 SIGBUS。
如果您需要应用程序从该错误中正确恢复,则在使用 fallocate 进行 mmap 之前显式保留空间是有意义的。在fallocate调用之后处理errno中的ENOSPC比处理信号简单得多,特别是在多线程应用程序中。
This was briefly mentioned above as a "failed IO request", but I'll expand upon it a bit.
A frequent case is when you lazily grow a file using ftruncate, map it into memory, start writing data and then run out of space in your filesystem. Physical space for mapped file is allocated on page faults, if there's none left then process receives a SIGBUS.
If you need your application to correctly recover from this error, it makes sense to explicitly reserve space prior to mmap using fallocate. Handling ENOSPC in errno after fallocate call is much simpler than dealing with signals, especially in a multi-threaded application.
如果您使用
mmap
和MAP_HUGETLB
标志请求由大页支持的映射,则如果内核用完分配的大页并且您可以获取SIGBUS
因此无法处理页面错误。在这种情况下,您需要通过
/sys/kernel/mm/hugepages/hugepages-/nr_hugepages
或/sys/devices/ 增加分配的大页面数量NUMA 系统上的 system/node/nodeX/hugepages/hugepages-/nr_hugepages
。If you request a mapping backed by hugepages with
mmap
and theMAP_HUGETLB
flag, you can getSIGBUS
if the kernel runs out of allocated huge pages and thus cannot handle a page fault.In this case, you'll need to raise the number of allocated huge pages via
/sys/kernel/mm/hugepages/hugepages-<size>/nr_hugepages
or/sys/devices/system/node/nodeX/hugepages/hugepages-<size>/nr_hugepages
on NUMA systems.x86 Linux 上总线错误的常见原因是尝试取消引用实际上不是指针或野指针的内容。例如,未能初始化指针,或将任意整数分配给指针然后尝试取消引用它通常会产生分段错误或总线错误。
对齐确实适用于 x86。尽管 x86 上的内存是字节可寻址的(因此您可以有一个指向任何地址的 char 指针),但如果您有一个指向 4 字节整数的指针,则该指针必须对齐。
您应该在 gdb 中运行程序并确定哪个指针访问正在生成总线错误以诊断问题。
A common cause of a bus error on x86 Linux is attempting to dereference something that is not really a pointer, or is a wild pointer. For example, failing to initialize a pointer, or assigning an arbitrary integer to a pointer and then attempting to dereference it will normally produce either a segmentation fault or a bus error.
Alignment does apply to x86. Even though memory on an x86 is byte-addressable (so you can have a char pointer to any address), if you have for example an pointer to a 4-byte integer, that pointer must be aligned.
You should run your program in gdb and determine which pointer access is generating the bus error to diagnose the issue.
这有点偏僻,但您可以从未对齐的 SSE2 (m128) 负载中获取 SIGBUS。
It's a bit off the beaten path, but you can get SIGBUS from an unaligned SSE2 (m128) load.