当前位置：文江博客话题详情

软件在生产环境中崩溃，无法访问调试器。短期和长期应该做什么？

发布于 2024-12-11 20:23:51 字数 1431 浏览 2 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

入画浅相思 2024-12-18 20:23:51

短期
1. 绝对要做的第一件事就是找出产生问题的原因并尝试重现它。如果您能做到这一点，那么您现在就可以在调试环境中跟踪它。
2. 如果无法重现，您需要查看在第一步中收集的所有信息（其中包括任何日志记录），看看是否可以发现可能的问题。
3. 如果还没有发现问题，您将需要添加日志记录，而且是大量的日志记录。这就是“DEBUG”日志记录设置派上用场的地方。它可能会减慢系统速度，甚至可能掩盖问题（这会告诉您有关问题本质的信息）。
4. 使用新的日志信息，您可以返回到第一步。重复此操作，直至问题解决！

从长远来看，最明显的事情是确保您有足够的日志记录，即使必须打开和关闭它来发现问题。除此之外，您还需要尝试加强测试工作。

当您找到问题时，值得注意问题的类型（竞争条件、可伸缩性、数据库访问等）。这为您提供了应用更多自动化和手动测试的区域。

回复收藏 0 原文

赠佳期 2024-12-18 20:23:51

您有一些很好的初步想法，这是我的评论：

将日志记录添加到您的代码中 - 您将获得很少的信息
关于您的代码的操作系统。
如果您调用的方法可以引发异常，则应该捕获它们。不要让它们冒泡给最终用户！
立即运行 valgrind，而不是稍后
设置模拟生产环境的测试环境。从简单开始，然后增加复杂性，直到能够重现问题。你确实有测试环境，对吗？

回复收藏 0 原文

一束光，穿透我孤独的魂 2024-12-18 20:23:51

您应该做的第一件事是确定问题的严重性。这将有助于制定您的短期策略。您需要与软件的主要利益相关者（例如客户）进行一些简短的讨论，或者让项目经理这样做并向您报告。

在一时冲动的情况下，这一点常常被忽视，而匆忙进行短期修复几乎总是意味着浪费大量时间，而不是真正理解需要做什么。

此后，您的实际策略（无论是长期还是短期）都相当依赖于您正在使用的技术及其部署方式。

短期

在尝试解决问题之前，获取有关崩溃的一些初步信息绝对至关重要，获取日志文件、截取屏幕截图、记下内存/CPU 使用情况等系统信息、存档任何临时数据可能有用。

短期行动应该是让系统快速重新启动并运行。短期解决方案的一些常见方法：

尝试将其关闭然后再次打开......说真的，90％的时间这
将在短期内恢复生产，至少直到
该错误再次显现出来。
恢复到以前的生产
版本，最好是已知运行良好的最新版本
可靠。
在另一台机器上运行第二个实例并进行故障转移，如果
问题又出现了。这有额外的好处，即记录和
上次崩溃发生后系统状态被保留。

长期

从长远来看，您将需要正确分析发生故障时收集的信息。在可能的情况下，尝试尽可能地重现问题。将代码恢复到正在部署的版本（您确实使用版本控制工具，对吗？），检查高级因素和低级配置因素。例如，系统崩溃时谁在使用系统？他们可以向您展示他们做了什么吗？

在此阶段，调试和日志记录以及所有常用的开发人员工具（例如功能测试和内存分析工具）可能很有用。崩溃可能有多种原因，从内存保护故障到资源的意外状态。您应该编制一份候选问题列表，并在您确信这些问题不是导致崩溃的原因时将其划掉。

The very first thing you should do is determine the severity of the problem. This will help to devise your short-term strategy. You will need to have some brief discussions with the major stakeholders in the software (such as the client), or have a project manager do this and report back to you.

In the heat of the moment, this is often the bit overlooked, and rushing a short-term fix almost always means wasting a lot of time not really understanding what needs to be done.

After this, your actual strategy, both long term and short term, is rather dependent on the technology you are using and how it is deployed.

Short term

It is absolutely vital to grab some preliminary information about the crash before attempting to resolve the problem, grab log files, take screenshots, note down system info like memory/CPU usage, archive any temporary data that might be useful.

The short-term action should be to get the system up-and-running again, quickly. Some common approaches to short-term solutions:

Try turning it off and on again... Seriously, 90% of the time this
will get production running again in the short term, at least until
the bug manifests itself again.
Revert to a previous production
release, preferably the latest version that was known to work fairly
reliably.
Run a second instance on another machine and fail-over if
the problem occurs again. This has the added bonus that logs and
system state are preserved after the last crash occurred.

Long term

In the long term, you will want to properly analyse the information you gathered at the time of failure. Where possible, try to reproduce the problem as closely as you can. Revert your code to the version being deployed (you do use version control tools right?), check high-level factors as well as low-level configuration ones. e.g. who was using the system when it crashed? Can they show you what they did?

Debugging and logging may be useful at this stage, and all the usual developer tools such as functional tests and memory profiling tools. A crash could come from a number of sources, from memory protection faults to an unexpected state of a resource. You should compile a list of candidate problems, and cross them off as you gain confidence that they aren't the cause of the crash.

回复收藏 0 原文

帅冕 2024-12-18 20:23:51

除了日志记录之外，您还可以创建 mdmp 文件 (windows) 或核心转储 (linux)，然后稍后检查它们；这种方法的一个缺点是核心转储可能非常大。 mdmp 和核心转储包含崩溃发生时应用程序的上下文。

回复收藏 0 原文

~没有更多了~

关于作者

帅气称霸

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

软件在生产环境中崩溃，无法访问调试器。短期和长期应该做什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

软件在生产环境中崩溃，无法访问调试器。短期和长期应该做什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。