Linux 服务崩溃
我有一个 linux 服务(c++,有很多可加载模块,基本上是在运行时拾取的 .so 文件),它不时崩溃......我想了解这次崩溃并调查它,但是目前我有不知道如何继续。所以,我想问你以下问题:
- 如果Linux服务崩溃,“核心”文件在哪里创建?我已经设置了 ulimit -c 102400,这应该足够了,但是我在任何地方都找不到核心文件:(。
- 是否有跟踪服务的 linux 日志?服务自己的日志显然不会告诉我我会崩溃现在...
- 可能是其中一个模块崩溃了...但是我什至不知道加载了哪些模块。您知道如何在Linux中显示服务正在使用哪些模块吗
- ?调试时可能有的提示linux 服务?
I have a linux service (c++, with lots of loadable modules, basically .so files picked up at runtime) which from time to time crashes ... I would like to get behind this crash and investigate it, however at the moment I have no clue how to proceed. So, I'd like to ask you the following:
- If a linux service crashes where is the "core" file created? I have set ulimit -c 102400, this should be enough, however I cannot find the core files anywhere :(.
- Are there any linux logs that track services? The services' own log obviously is not telling me that I'm going to crash right now...
- Might be that one of the modules is crashing ... however I cannot tell which one. I cannot even tell which modules are loaded. Do you know how to show in linux which modules a service is using?
- Any other hints you might have in debugging a linux service?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 Linux 下,出于安全原因,切换用户 ID 的进程会禁用其核心文件。这是因为他们经常执行诸如读取特权文件(例如 /etc/shadow)之类的操作,并且核心文件可能包含敏感信息。
要在已切换用户 ID 的进程上启用核心转储,可以将 prctl 与 PR_SET_DUMPABLE 结合使用。
核心文件通常转储到当前工作目录中 - 如果当前用户不可写,那么它将失败。确保进程的当前工作目录可写。
Under Linux, processes which switch user ID, get their core files disabled for security reasons. This is because they often do things like reading privileged files (think /etc/shadow) and a core file could contain sensitive information.
To enable core dumping on processes which have switched user ID, you can use prctl with PR_SET_DUMPABLE.
Core files are normally dumped in the current working directory - if that is not writable by the current user, then it will fail. Ensure that the process's current working directory is writable.
0) 获得一个尽可能模仿生产的暂存环境。在那里重现问题。
1) 您可以使用
gdb -a
附加到正在运行的进程(当然需要调试版本)2) 确保 ulimit 是您认为的那样(输出ulimit 从 shell 脚本到文件
它在启动之前运行您的服务)。通常需要在/etc/profile文件中设置ulimit;将其设置为无限制
3) 使用
find / -name \*core\* -print
或类似的方法查找核心文件4) 我认为 gdb 会给你附加到进程时加载的共享对象 (.so) 的列表。
5) 为您的服务添加更多日志记录
祝您好运!
0) Get a staging environment which mimics production as close as possible. Reproduce problem there.
1) You can attach to a running process using
gdb -a
(need a debug build of course)2) Make sure the ulimit is what you think it is (output ulimit to a file from the shell script
which runs your service right before starting it). Usually you need to set ulimit in /etc/profile file; set it
ulimit -c 0
for unlimited3) Find the core file using
find / -name \*core\* -print
or similar4) I think gdb will give you the list of loaded shared objects (.so) when you attach to the process.
5) Add more logging to your service
Good luck!
您的首要任务应该是获取核心文件。看看这个答案是否适用。
其次,您应该在 Valgrind 下运行服务器,并修复它发现的任何错误。
在 GDB 下运行时重现崩溃(如 MK 建议)是可能的,但有点不太可能:当您寻找错误时,错误往往会隐藏起来,并且调试器可能会影响计时(特别是如果您的服务器是多线程的)。
Your first order of business should be getting a core file. See if this answer applies.
Second, you should run your server under Valgrind, and fix any errors it finds.
Reproducing the crash when running under GDB (as MK suggested) is possible, but somewhat unlilkely: bugs tend to hide when you are looking for them, and the debugger may affect timing (especially if your server is multi-threaded).