fopen() 挂起。有时
我正在 Debian Etch 上运行: Linux nereus 2.6.18-6-686 #1 SMP Sat Dec 27 09:31:05 UTC 2008 i686 GNU/Linux
我有一个多线程 c 应用程序,一个线程挂起。有时。通过核心文件,我发现它挂在 fopen() 上:
#0 0xb7f4b410 in ?? ()
#1 0xb660521c in ?? ()
#2 0x000001b6 in ?? ()
#3 0x00008241 in ?? ()
#4 0xb77c45bb in open () from /lib/tls/i686/cmov/libc.so.6
#5 0xb7768142 in _IO_file_open () from /lib/tls/i686/cmov/libc.so.6
#6 0xb77682e8 in _IO_file_fopen () from /lib/tls/i686/cmov/libc.so.6
#7 0xb775d8c9 in fgets () from /lib/tls/i686/cmov/libc.so.6
#8 0xb775fe0a in fopen64 () from /lib/tls/i686/cmov/libc.so.6
#9 0x0805600f in comric_write_external_track_file (control=0xbfc9c284) at ../COMRIC/comric_thread.c:784
#10 0x08055b0e in store_tracks (control=0xbfc9c284, hdr=0xb3d1b828) at ../COMRIC/comric_thread.c:695
#11 0x080568be in comric_thread (userdata=0xbfc9c284) at ../COMRIC/comric_thread.c:997
#12 0xb789530f in g_thread_create_full () from /usr/lib/libglib-2.0.so.0
#13 0xb783f240 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#14 0xb77d349e in clone () from /lib/tls/i686/cmov/libc.so.6
该线程从外部源获取数据,处理它,并将其写入文本文件。当我们获得新数据时,文本文件会被一遍又一遍地写入。没有其他人正在访问该文件。文件大小通常小于 1KB。我正在检查 fclose() 调用以确保它返回成功,确实如此。
当主线程检测到我们在超过 30 秒内没有收到问题线程的消息时,它会调用 abort(),以便我们可以获得上面看到的核心转储。
99%的时候,一切都进展顺利。但在过去四天里,这种情况发生的次数越来越多(每天 6 次以上)。我担心这可能是硬盘问题,但我在任何日志中都找不到报告的任何错误。 (不幸的是,SMART信息不可用。)该应用程序已经顺利运行了2年。
有人有什么想法吗?
源代码:
int comric_write_external_track_file( struct ComricControl *control ) {
FILE *file;
if( strlen( control->extern_track_file ) == 0 ) return 1;
file = fopen( control->extern_track_file, "w" );
if( !file ) {
ps_slog( "ERROR opening external track file: \"%s\"", control->extern_track_file );
return 0;
}
// Write the file
G_MUTEX_LOCK( control->mutex );
g_hash_table_foreach( control->tracks, comric_write_track, file );
G_MUTEX_UNLOCK( control->mutex );
fsync( fileno( file ));
if( fclose( file ) != 0 ) {
ps_slog( "FATAL ERROR - fclose() FAILED with error \"%s\" (%d)", strerror( errno ), errno );
sleep( 1 ); abort(); // can we get any debug info out of this?
}
return 1;
}
我在网上进行一些搜索后添加了 fsync() 调用。起初,我认为这可能与 fclose() 失败有关,但事实似乎并非如此。
I am running on Debian Etch: Linux nereus 2.6.18-6-686 #1 SMP Sat Dec 27 09:31:05 UTC 2008 i686 GNU/Linux
I have a multi threaded c application, and one thread is hanging. Sometimes. Through core files, I have figured out that it is hanging on a fopen():
#0 0xb7f4b410 in ?? ()
#1 0xb660521c in ?? ()
#2 0x000001b6 in ?? ()
#3 0x00008241 in ?? ()
#4 0xb77c45bb in open () from /lib/tls/i686/cmov/libc.so.6
#5 0xb7768142 in _IO_file_open () from /lib/tls/i686/cmov/libc.so.6
#6 0xb77682e8 in _IO_file_fopen () from /lib/tls/i686/cmov/libc.so.6
#7 0xb775d8c9 in fgets () from /lib/tls/i686/cmov/libc.so.6
#8 0xb775fe0a in fopen64 () from /lib/tls/i686/cmov/libc.so.6
#9 0x0805600f in comric_write_external_track_file (control=0xbfc9c284) at ../COMRIC/comric_thread.c:784
#10 0x08055b0e in store_tracks (control=0xbfc9c284, hdr=0xb3d1b828) at ../COMRIC/comric_thread.c:695
#11 0x080568be in comric_thread (userdata=0xbfc9c284) at ../COMRIC/comric_thread.c:997
#12 0xb789530f in g_thread_create_full () from /usr/lib/libglib-2.0.so.0
#13 0xb783f240 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#14 0xb77d349e in clone () from /lib/tls/i686/cmov/libc.so.6
This thread gets data from an external source, processes it, and writes it to a text file. The text file is being written over and over and over again, as we get new data. No one else is accessing this file. The file size is typically less than 1KB. I am checking the fclose() call to make sure it is returning success, and it is.
When the main thread detects that we haven't heard from the problem thread in more than 30 seconds, it calls abort() so we can get the core dump you see above.
99% of the time, everything runs smoothly. But in the last four days, this is been happening more and more (6+ times a day). I worried that it might be a hard drive problem, but I cannot find any errors reported in any of the logs. (Unfortunately, SMART information is not available.) This application has been running smoothly for 2 years.
Anyone have any thoughts?
Source code:
int comric_write_external_track_file( struct ComricControl *control ) {
FILE *file;
if( strlen( control->extern_track_file ) == 0 ) return 1;
file = fopen( control->extern_track_file, "w" );
if( !file ) {
ps_slog( "ERROR opening external track file: \"%s\"", control->extern_track_file );
return 0;
}
// Write the file
G_MUTEX_LOCK( control->mutex );
g_hash_table_foreach( control->tracks, comric_write_track, file );
G_MUTEX_UNLOCK( control->mutex );
fsync( fileno( file ));
if( fclose( file ) != 0 ) {
ps_slog( "FATAL ERROR - fclose() FAILED with error \"%s\" (%d)", strerror( errno ), errno );
sleep( 1 ); abort(); // can we get any debug info out of this?
}
return 1;
}
I added the fsync() call after doing some hunting on the net. At first, I thought this might be related to the fclose() failing, but it doesn't seem to be the case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您挂在
open()
上 - 所以这可能是内核或驱动程序级别的问题。首先,检查 dmesg 是否有明显的错误消息。如果失败,您可以尝试使用 SysRq w 命令来获取有问题进程的堆栈跟踪。
You're hanging in
open()
- so this is likely a problem at the kernel or driver level.First, check
dmesg
for obvious error messages. If this fails, you can try useing the SysRq w command to get a stacktrace of the offending process.