Short Term 1. The absolute first thing to do is work out what was done to generate the problem and try and reproduce it. If you can do that, you can now track it down in a debugged environment. 2. If it is not reproducible, you need to look through all the information you collected in step one (which will include any logging) and see if you can see a possible problem. 3. If the problem has not been found, you will need to add logging, and lots of it. This is where a "DEBUG" logging setting comes in handy. It will probably slow down the system, and may even mask the problem (which tells you something about the nature of the problem). 4. With the new logging information you can go back to step one. Repeat this until the problem is solved!
In the long term the most obvious thing to do is make sure you have sufficient logging in place, even if it has to be turned on and off, to catch problems. As well as this, you need to try and beef up the testing effort..
When you have tracked down a problem, it is worth noting the type of problem (race condition, scalability, database access, etc.). This gives you an area to apply more automated and manual tests.
You have some good initial ideas, here are my comments:
Add logging to your code - you will get very little information from the operating system about your code.
If exceptions can be thrown by methods that you call, you should catch them. Don't let them bubble up to the end user!
Run valgrind now, not later
Setup a test environment that simulates your production environment. Start simple, and increase the complexity until you are able to reproduce your issue. You do have a test environment, right?
The very first thing you should do is determine the severity of the problem. This will help to devise your short-term strategy. You will need to have some brief discussions with the major stakeholders in the software (such as the client), or have a project manager do this and report back to you.
In the heat of the moment, this is often the bit overlooked, and rushing a short-term fix almost always means wasting a lot of time not really understanding what needs to be done.
After this, your actual strategy, both long term and short term, is rather dependent on the technology you are using and how it is deployed.
Short term
It is absolutely vital to grab some preliminary information about the crash before attempting to resolve the problem, grab log files, take screenshots, note down system info like memory/CPU usage, archive any temporary data that might be useful.
The short-term action should be to get the system up-and-running again, quickly. Some common approaches to short-term solutions:
Try turning it off and on again... Seriously, 90% of the time this will get production running again in the short term, at least until the bug manifests itself again.
Revert to a previous production release, preferably the latest version that was known to work fairly reliably.
Run a second instance on another machine and fail-over if the problem occurs again. This has the added bonus that logs and system state are preserved after the last crash occurred.
Long term
In the long term, you will want to properly analyse the information you gathered at the time of failure. Where possible, try to reproduce the problem as closely as you can. Revert your code to the version being deployed (you do use version control tools right?), check high-level factors as well as low-level configuration ones. e.g. who was using the system when it crashed? Can they show you what they did?
Debugging and logging may be useful at this stage, and all the usual developer tools such as functional tests and memory profiling tools. A crash could come from a number of sources, from memory protection faults to an unexpected state of a resource. You should compile a list of candidate problems, and cross them off as you gain confidence that they aren't the cause of the crash.
Apart from logging, you can enable creation of mdmp files ( windows ) or the core dumps ( linux ) then examine them later; One downside of this approach is that core dumps can be pretty big. mdmp and core dumps contain the context of the application when the crash occurred.
发布评论
评论(4)
短期
1. 绝对要做的第一件事就是找出产生问题的原因并尝试重现它。如果您能做到这一点,那么您现在就可以在调试环境中跟踪它。
2. 如果无法重现,您需要查看在第一步中收集的所有信息(其中包括任何日志记录),看看是否可以发现可能的问题。
3. 如果还没有发现问题,您将需要添加日志记录,而且是大量的日志记录。这就是“DEBUG”日志记录设置派上用场的地方。它可能会减慢系统速度,甚至可能掩盖问题(这会告诉您有关问题本质的信息)。
4. 使用新的日志信息,您可以返回到第一步。重复此操作,直至问题解决!
从长远来看,最明显的事情是确保您有足够的日志记录,即使必须打开和关闭它来发现问题。除此之外,您还需要尝试加强测试工作。
当您找到问题时,值得注意问题的类型(竞争条件、可伸缩性、数据库访问等)。这为您提供了应用更多自动化和手动测试的区域。
Short Term
1. The absolute first thing to do is work out what was done to generate the problem and try and reproduce it. If you can do that, you can now track it down in a debugged environment.
2. If it is not reproducible, you need to look through all the information you collected in step one (which will include any logging) and see if you can see a possible problem.
3. If the problem has not been found, you will need to add logging, and lots of it. This is where a "DEBUG" logging setting comes in handy. It will probably slow down the system, and may even mask the problem (which tells you something about the nature of the problem).
4. With the new logging information you can go back to step one. Repeat this until the problem is solved!
In the long term the most obvious thing to do is make sure you have sufficient logging in place, even if it has to be turned on and off, to catch problems. As well as this, you need to try and beef up the testing effort..
When you have tracked down a problem, it is worth noting the type of problem (race condition, scalability, database access, etc.). This gives you an area to apply more automated and manual tests.
您有一些很好的初步想法,这是我的评论:
关于您的代码的操作系统。
You have some good initial ideas, here are my comments:
the operating system about your code.
您应该做的第一件事是确定问题的严重性。这将有助于制定您的短期策略。您需要与软件的主要利益相关者(例如客户)进行一些简短的讨论,或者让项目经理这样做并向您报告。
在一时冲动的情况下,这一点常常被忽视,而匆忙进行短期修复几乎总是意味着浪费大量时间,而不是真正理解需要做什么。
此后,您的实际策略(无论是长期还是短期)都相当依赖于您正在使用的技术及其部署方式。
短期
在尝试解决问题之前,获取有关崩溃的一些初步信息绝对至关重要,获取日志文件、截取屏幕截图、记下内存/CPU 使用情况等系统信息、存档任何临时数据可能有用。
短期行动应该是让系统快速重新启动并运行。短期解决方案的一些常见方法:
将在短期内恢复生产,至少直到
该错误再次显现出来。
版本,最好是已知运行良好的最新版本
可靠。
问题又出现了。这有额外的好处,即记录和
上次崩溃发生后系统状态被保留。
长期
从长远来看,您将需要正确分析发生故障时收集的信息。在可能的情况下,尝试尽可能地重现问题。将代码恢复到正在部署的版本(您确实使用版本控制工具,对吗?),检查高级因素和低级配置因素。例如,系统崩溃时谁在使用系统?他们可以向您展示他们做了什么吗?
在此阶段,调试和日志记录以及所有常用的开发人员工具(例如功能测试和内存分析工具)可能很有用。崩溃可能有多种原因,从内存保护故障到资源的意外状态。您应该编制一份候选问题列表,并在您确信这些问题不是导致崩溃的原因时将其划掉。
The very first thing you should do is determine the severity of the problem. This will help to devise your short-term strategy. You will need to have some brief discussions with the major stakeholders in the software (such as the client), or have a project manager do this and report back to you.
In the heat of the moment, this is often the bit overlooked, and rushing a short-term fix almost always means wasting a lot of time not really understanding what needs to be done.
After this, your actual strategy, both long term and short term, is rather dependent on the technology you are using and how it is deployed.
Short term
It is absolutely vital to grab some preliminary information about the crash before attempting to resolve the problem, grab log files, take screenshots, note down system info like memory/CPU usage, archive any temporary data that might be useful.
The short-term action should be to get the system up-and-running again, quickly. Some common approaches to short-term solutions:
will get production running again in the short term, at least until
the bug manifests itself again.
release, preferably the latest version that was known to work fairly
reliably.
the problem occurs again. This has the added bonus that logs and
system state are preserved after the last crash occurred.
Long term
In the long term, you will want to properly analyse the information you gathered at the time of failure. Where possible, try to reproduce the problem as closely as you can. Revert your code to the version being deployed (you do use version control tools right?), check high-level factors as well as low-level configuration ones. e.g. who was using the system when it crashed? Can they show you what they did?
Debugging and logging may be useful at this stage, and all the usual developer tools such as functional tests and memory profiling tools. A crash could come from a number of sources, from memory protection faults to an unexpected state of a resource. You should compile a list of candidate problems, and cross them off as you gain confidence that they aren't the cause of the crash.
除了日志记录之外,您还可以创建 mdmp 文件 (windows) 或核心转储 (linux),然后稍后检查它们;这种方法的一个缺点是核心转储可能非常大。 mdmp 和核心转储包含崩溃发生时应用程序的上下文。
Apart from logging, you can enable creation of mdmp files ( windows ) or the core dumps ( linux ) then examine them later; One downside of this approach is that core dumps can be pretty big. mdmp and core dumps contain the context of the application when the crash occurred.