Ruby/Glibc coredump（双重释放或损坏）

发布于 2024-08-20 17:19:56 字数 3082 浏览 13 评论 0原文

我正在使用我自己用 Ruby 编写的分布式持续集成工具。它使用 Mike Perham 的“政治”分支来分配任务。 “政治”模块正在使用 mDNS 部分的线程。

我时不时地会遇到一个我不明白的核心转储：

*** glibc detected *** ruby: double free or corruption (fasttop): 0x086d8600 ***
======= Backtrace: =========
/lib/libc.so.6[0xb7cef494]
/lib/libc.so.6[0xb7cf0b93]
/lib/libc.so.6(cfree+0x6d)[0xb7cf3c7d]
/usr/lib/libruby18.so.1.8[0xb7e8adf8]
/usr/lib/libruby18.so.1.8(ruby_xmalloc+0x85)[0xb7e8b395]
/usr/lib/libruby18.so.1.8[0xb7e5065e]
...
/usr/lib/libruby18.so.1.8[0xb7e717f4]
/usr/lib/libruby18.so.1.8[0xb7e74296]
/usr/lib/libruby18.so.1.8(rb_yield+0x27)[0xb7e7fb57]
======= Memory map: ========
...

我正在 Gentoo 上运行，并使用“-gdbg”重建了 Ruby 和 Glibc，并关闭了条带化以获得有意义的核心：

...
Core was generated by `ruby /home/develop/dcc/bin/dcc-worker'.
Program terminated with signal 6, Aborted.
#0  0xb7f20410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f20410 in __kernel_vsyscall ()
#1  0xb7cacb60 in *__GI___open_catalog (cat_name=0x6 <Address 0x6 out of bounds>, nlspath=0xbf9d6f00 " ", env_var=0x0, catalog=0x1) at open_catalog.c:237
#2  0xb7cae498 in __sigdelset (set=0x6) from /lib/libc.so.6
#3  *__GI_sigfillset (set=0x6) at ../signal/sigfillset.c:42
#4  0xb7ce952d in freopen64 (filename=0x2 <Address 0x2 out of bounds>, mode=0xb7db02c8 "\" total=\"%zu\" count=\"%zu\"/>\n", fp=0x9) at freopen64.c:47
#5  0xb7cef494 in _IO_str_init_readonly (sf=0x86d8600, ptr=0xb7eef5a9 "te\213V\b\205\322\017\204\220", size=-1210273804) at strops.c:88
#6  0xb7cf0b93 in mALLINFo (av=0xb) at malloc.c:5865
#7  0xb7cf3c7d in __libc_calloc (n=141395456, elem_size=3214793136) at malloc.c:4019
#8  0xb7e8adf8 in ?? () at gc.c:1390 from /usr/lib/libruby18.so.1.8
#9  0x086d8600 in ?? ()
#10 0xb7e89400 in rb_gc_disable () at gc.c:256
#11 0xb7e8b395 in add_freelist () at gc.c:1087
#12 gc_sweep () at gc.c:1186
#13 garbage_collect () at gc.c:1524
#14 0xb7e5065e in ?? () from /usr/lib/libruby18.so.1.8
#15 0x00000340 in ?? ()
#16 0x00000000 in ?? ()
(gdb)

嗯？？？对我来说，这看起来完全是 Ruby 实习生。关于 stackoverflow 上的其他“双重释放或损坏”问题，我发现线程可能是问题的一部分。

而且问题不会发生在完全相同的位置。我有另一个回溯，它更长，但崩溃也在 garbage_collect 中，但路径略有不同：

(gdb) bt
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xf7c8b8c0 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xf7c8d1f5 in *__GI_abort () at abort.c:88
#3  0xf7cc7e35 in __libc_message (do_abort=2, fmt=0xf7d8daa8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#4  0xf7ccdd24 in malloc_printerr (action=2, str=0xf7d8dbec "double free or corruption (fasttop)", ptr=0x911f5d0) at malloc.c:6197
#5  0xf7ccf403 in _int_free (av=0xf7daa380, p=0x911f5c8) at malloc.c:4750
#6  0xf7cd24ad in *__GI___libc_free (mem=0x911f5d0) at malloc.c:3716
#7  0xf7e68768 in obj_free () at gc.c:1366
#8  gc_sweep () at gc.c:1174
#9  garbage_collect () at gc.c:1524
#10 0xf7e68be5 in rb_newobj () at gc.c:436
#11 0xf7eb9840 in str_alloc (klass=0) at string.c:67
... (150 lines of rb_eval/call/yield etc.)

有没有人建议如何隔离并解决这个问题？

原文

I am using a distributed continuous integration tool which I have written by myself in Ruby. It uses a fork of Mike Perham's "politics" for distribution of the tasks. The "politics" module is using threads for the mDNS part.

Every now and then I encounter a core dump which I don't understand:

*** glibc detected *** ruby: double free or corruption (fasttop): 0x086d8600 ***
======= Backtrace: =========
/lib/libc.so.6[0xb7cef494]
/lib/libc.so.6[0xb7cf0b93]
/lib/libc.so.6(cfree+0x6d)[0xb7cf3c7d]
/usr/lib/libruby18.so.1.8[0xb7e8adf8]
/usr/lib/libruby18.so.1.8(ruby_xmalloc+0x85)[0xb7e8b395]
/usr/lib/libruby18.so.1.8[0xb7e5065e]
...
/usr/lib/libruby18.so.1.8[0xb7e717f4]
/usr/lib/libruby18.so.1.8[0xb7e74296]
/usr/lib/libruby18.so.1.8(rb_yield+0x27)[0xb7e7fb57]
======= Memory map: ========
...

I am running on Gentoo and have rebuild Ruby and Glibc with "-gdbg" and turned off the striping to get a meaningful core:

...
Core was generated by `ruby /home/develop/dcc/bin/dcc-worker'.
Program terminated with signal 6, Aborted.
#0  0xb7f20410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f20410 in __kernel_vsyscall ()
#1  0xb7cacb60 in *__GI___open_catalog (cat_name=0x6 <Address 0x6 out of bounds>, nlspath=0xbf9d6f00 " ", env_var=0x0, catalog=0x1) at open_catalog.c:237
#2  0xb7cae498 in __sigdelset (set=0x6) from /lib/libc.so.6
#3  *__GI_sigfillset (set=0x6) at ../signal/sigfillset.c:42
#4  0xb7ce952d in freopen64 (filename=0x2 <Address 0x2 out of bounds>, mode=0xb7db02c8 "\" total=\"%zu\" count=\"%zu\"/>\n", fp=0x9) at freopen64.c:47
#5  0xb7cef494 in _IO_str_init_readonly (sf=0x86d8600, ptr=0xb7eef5a9 "te\213V\b\205\322\017\204\220", size=-1210273804) at strops.c:88
#6  0xb7cf0b93 in mALLINFo (av=0xb) at malloc.c:5865
#7  0xb7cf3c7d in __libc_calloc (n=141395456, elem_size=3214793136) at malloc.c:4019
#8  0xb7e8adf8 in ?? () at gc.c:1390 from /usr/lib/libruby18.so.1.8
#9  0x086d8600 in ?? ()
#10 0xb7e89400 in rb_gc_disable () at gc.c:256
#11 0xb7e8b395 in add_freelist () at gc.c:1087
#12 gc_sweep () at gc.c:1186
#13 garbage_collect () at gc.c:1524
#14 0xb7e5065e in ?? () from /usr/lib/libruby18.so.1.8
#15 0x00000340 in ?? ()
#16 0x00000000 in ?? ()
(gdb)

Hmm??? For me this looks like it's totally Ruby intern. On other "double free or corruption" problems here at stackoverflow I have seen that maybe threads are part of the problem.

Also the problem does not occur at the exactly same position. I have another backtrace which is much longer but the crash is also in garbage_collect but with a slightly different path:

(gdb) bt
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xf7c8b8c0 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xf7c8d1f5 in *__GI_abort () at abort.c:88
#3  0xf7cc7e35 in __libc_message (do_abort=2, fmt=0xf7d8daa8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#4  0xf7ccdd24 in malloc_printerr (action=2, str=0xf7d8dbec "double free or corruption (fasttop)", ptr=0x911f5d0) at malloc.c:6197
#5  0xf7ccf403 in _int_free (av=0xf7daa380, p=0x911f5c8) at malloc.c:4750
#6  0xf7cd24ad in *__GI___libc_free (mem=0x911f5d0) at malloc.c:3716
#7  0xf7e68768 in obj_free () at gc.c:1366
#8  gc_sweep () at gc.c:1174
#9  garbage_collect () at gc.c:1524
#10 0xf7e68be5 in rb_newobj () at gc.c:436
#11 0xf7eb9840 in str_alloc (klass=0) at string.c:67
... (150 lines of rb_eval/call/yield etc.)

Has anyone a suggestion how to isolate and maybe solve this problem?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

呆头 2024-08-27 17:19:56

快速、简单，但没有那么有用：导出 MALLOC_CHECK_=2。这会导致 glibc 在 free() 期间执行一些额外级别的检查，以避免堆损坏。一旦检测到损坏，它就会abort()并提供核心转储，而不是等到出现由损坏引起的实际问题。

不是那么快速和简单，但更有帮助（如果你让它工作的话）：valgrind。

回复收藏 0 原文

幸福不弃 2024-08-27 17:19:56

Valgrind 可以轻松发现堆损坏问题。在 valgrind 下使用 Ruby 1.8 时会报告一些虚假错误，但可以使用此 ruby 补丁（并使用 --enable-valgrind 进行配置）或使用 valgrind 抑制文件。要在 valgrind 下运行 ruby 程序，只需在命令前加上 valgrind 前缀：

valgrind ruby /home/develop/dcc/bin/dcc-worker

如果崩溃的进程是您正在运行的进程的子进程，请使用 valgrind --trace-children=yes代码>.特别注意无效写入，这是堆损坏的迹象。

Valgrind makes it easy to find heap corruption issues. There are some spurious errors reported when using Ruby 1.8 under valgrind, but they can be eliminated using this ruby patch (and configuring with --enable-valgrind) or using a valgrind suppression file. To run your ruby program under valgrind, just prefix the command with valgrind:

valgrind ruby /home/develop/dcc/bin/dcc-worker

If the crashing process is a child of the process you are running, use valgrind --trace-children=yes. Look in particular for invalid writes, which are a sign of heap corruption.

回复收藏 0 原文