调试仅生产错误的过程是什么?

发布于 2024-09-05 07:04:05 字数 760 浏览 2 评论 0原文

首先我要说的是,我对这个话题非常无知,我什至不知道这个问题是否有客观的答案。如果最终结果是“不”,我将删除或投票关闭该帖子。

场景如下:我刚刚编写了一个小 Web 服务。它可以在我的机器上运行。它可以在我团队领导的机器上运行。据我所知,它可以在除生产服务器之外的每台机器上运行。生产服务器在失败时抛出的异常来自第三方JAR文件,并且信息很少。我在网上搜索了几个小时,但没有找到任何有用的东西。

那么,追踪仅发生在生产机器上的问题的程序是什么呢?是否有一个标准的方法,或者可能是一个类别/系列的工具?

引发这个问题的错误已经得到修复,但这更多的是由于运气好,而不是可靠的调试方法。我问这个问题是为了以后参考。

编辑:
到目前为止,这个问题的答案似乎可以用一个词来概括:日志记录。日志记录的一个问题是它需要深思熟虑。如果现有系统中出现日志记录不良的情况,或者客户担心敏感数据并且首先不希望在系统中使用大量日志记录系统,该怎么办?

一些相关问题:
在生产系统中测试帐户和产品
在生产代码/服务器上运行测试

Let me say upfront that I'm so ignorant on this topic that I don't even know whether this question has objective answers or not. If it ends up being "not," I'll delete or vote to close the post.

Here's the scenario: I just wrote a little web service. It works on my machine. It works on my team lead's machine. It works, as far as I can tell, on every machine except for the production server. The exception that the production server spits out upon failure originates from a third-party JAR file, and is skimpy on information. I search the web for hours, but don't come up with anything useful.

So what's the procedure for tracking down an issue that occurs only on production machines? Is there a standard methodology, or perhaps a category/family of tools, for this?

The error that inspired this question has already been fixed, but that was due more to good fortune than a solid approach to debugging. I'm asking this question for future reference.

EDIT:
The answer to this so far seems to be summed up by one word: logging. The one issue with logging is that it requires forethought. What if a situation comes up in an existing system with poor logging, or the client is worried about sensitive data and does not want extensive logging systems in the system in the first place?

Some related questions:
Test accounts and products in a production system
Running test on Production Code/Server

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

揪着可爱 2024-09-12 07:04:05

除了非常宝贵的日志记录之外,这里还有我和我的同事多年来使用过的一些其他技术……回到我们无法访问的客户端计算机上的 16 位 Windows。 (我和自己约会了吗?) 当然,并不是所有事情都能/将会奏效。

  • 分析您看到的任何和所有行为。
  • 复制,如果可能的话,复制它。
  • 进行桌面检查,浏览您怀疑的代码。
  • 与团队成员和对代码不太熟悉或不熟悉的人一起解决问题。你向某人解释某事的次数越多,你发现某事的机会就越大。
  • 不要沮丧。休息 5-10 分钟。快速步行穿过建筑物/街道/任何地方。暂时别想这个问题。
  • 听从你的直觉。

In addition to logging, which is invaluable, here are are some other techniques myself and my co-workers have used over the years... going back to 16-bit windows on client machines we had no access to. (Did I date myself?) Granted, not everything can/will work.

  • Analyze any and all behavior you see.
  • Reproduce, if at all possible, reproduce it.
  • Desk check, walk through code you suspect.
  • Rubber duck it with team members AND people who have little or no familiarity with the code. The more you have to explain something to someone, the better chance you have of uncovering something.
  • Don't get frustrated. Take a 5-10 minute break. Take a quick walk across the building/street/whatever. Don't think about the problem for that time.
  • Listen to your instincts.
面犯桃花 2024-09-12 07:04:05

这是最困难的调试场景之一。答案将取决于生产系统的细节。这是一个您可以完全控制的系统吗?或者它是否安装在客户端计算机上,您需要拨打大量电话才能访问日志文件或修改配置参数?

我相信大多数人都会同意最有效的调试方法是使用日志记录。您需要主动采取行动并添加尽可能多的日志信息。但是,您必须能够按需启用和禁用日志记录。生产系统中的大量调试日志可能会降低性能。出于同样的原因,您需要能够仅启用日志记录的特定部分。创建日志打印输出的逻辑组,并仅启用您认为能够为您提供最相关信息的组。

This is one of the most difficult debugging scenarios. The answer will depend on the details of the production system. Is it a system you have full control over it? Or is it installed in a client's machine and you need to get through numerous phone calls just to get access to log file or modify a configuration parameter?

I believe that the most people will agree that the most effective way of debugging this is to use logging. You need to act proactively and add as much logging information as possible. However you must be able to enable and disable logging on demand. Extensive debug logs in a production system could kill performance. For the same reason you need to be able to enable only specific parts of the logging. Create logical groups of logging print outs and enable only the one you think it will give you the most relevant information.

随风而去 2024-09-12 07:04:05

我会从生产和测试之间微小的、易于检查的差异开始。通过实际测试消除权限、防火墙、不同版本等明显的东西。有一次我抄近路说哦,这不可能,确实如此。

然后我根据可能性和成本优先考虑更昂贵的测试。要有创意。想一想可能会导致您所看到的行为的非常奇怪的事情。

I would start with the small, easy to check differences between production and test. Eliminate obvious stuff like permissions, firewalls, different versions, etc through actual testing. The one time I cut corners and say oh, that can't be it, it is.

Then I prioritize more expensive tests by likelihood and cost. Be creative. Think of really weird things that might cause the behaviour you see.

∝单色的世界 2024-09-12 07:04:05

通常来说,“调试”[即附加到进程并检查执行]是不可行的 - 出于多种原因,其中最重要的是数据敏感性[例如,开发人员很少有资格\被许可检查我们操作的数据]

所以这通常会出现直至从二手来源和工件推断执行情况。这可以归结为...

  • 日志记录,
  • 日志记录,
  • 日志记录,

现在编写的大多数软件都属于 Java 或 .Net 阵营,因此分别利用 log4j 和 log4net。

此外,拥有一个以 Ops 为中心的配置指南和验证流程也会有所帮助。请记住,负责硬件和环境的人员很少了解他们托管的应用程序的配置要求。

Typically speaking, "debugging" [ie attaching to a process and inspecting execution] is not viable - for many reasons not the least of which is data sensitivity [eg developers are rarely qualified\cleared to inspect the data we manipulate]

So this usually comes down to inferring execution from secondary sources and artifacts. This then boils down to ...

  • Logging,
  • Logging,
  • Logging,

A large majority of software written these days falls into either of Java or .Net camps, so leverage log4j and log4net respectively.

Also having a buller-proof Ops-centric configuration guide and validation process helps. Remember the people responsible for the hardware and environment rarely understand the configuration requirements of the applications they are hosting.

递刀给你 2024-09-12 07:04:05

我使用了可配置的日志系统(例如 Log4J)来查看生产运行中发生的情况,这假设开发人员已将有用的调试信息放入日志中。

但请注意,日志记录可能会暴露一些敏感的私有数据,应尽可能对这些数据进行编码和/或跳过。

I've used a configurable logging system such as Log4J to see what's happenning at the production runs, this assumes that developers have put useful debugging information in the logs.

But beware that logging might expose some sensible private data, which should be encoded and/or skipped when possible.

凡尘雨 2024-09-12 07:04:05

除了日志记录之外,其他技术还包括保存请求数据,然后您可以将这些数据输入到您自己的“相同”系统中。这可能很简单,只需将收到的每个 HTTP 请求保存到文件中以供以后分析即可。现在您可能正在记录大部分此类信息(特别是 GET 的 URL),您只需要添加标头和请求正文即可。

向错误消息添加更多详细信息也很方便。例如,当您从例程中获取异常时,您可以将该调用中使用的参数添加到异常错误中。或者,至少,全局状态信息(谁登录了,他们在哪个高级模块中,他们正在调用哪个高级函数,等等)。

Along with logging, other techniques include saving request data that you can then feed in to your own, "identical" system later. This could be as simple as saving every HTTP request you receive to a file for later analysis. Right now you are likely logging much of this information (notably URL for GETs), you just need to add headers and request bodies to the mix as well.

Adding more detail to error messages is handy also. For example, when you get an exception from a routine, you can add the parameters that were used in that call to the Exception error. Or, at least, global state information (who was logged in, what high level module they were in, what high level function they were calling, etc.).

橘味果▽酱 2024-09-12 07:04:05

一些建议:

  • 请做好准备,错误可能是由多种原因引起的,因此不要将注意力集中在只寻找一种原因上。
  • 使用未处理的错误处理程序,它将跟踪错误并聚合类似的缺陷(greylogELMAH)。
  • 考虑使用小型转储文件进行事后调试。
  • 为快速而肮脏的方法制定固定的时间框架,然后采用系统方法。
  • 尝试与您的一位同事一起对缺陷模块进行代码审查。新鲜的观点可能会有所帮助。
  • 使用版本控制系统(GIT、SVN)进行分而治之。
  • 修复时要小心,因为大约 4% 的修复最终会引入新的错误。
  • 不要因为快速修复生产中的错误而产生压力,从而忽略标准质量控制程序(例如代码审查)。
  • 修复后,请确保您已编写自动化测试,以防错误在一段时间后再次出现。

Some advices:

  • Be prepared that bug could be caused by multiple causes, so that try to not narrow your mind to searching for just one cause.
  • Use unhandled error handler, which will keep track of errors and aggregate similar defects (greylog, ELMAH).
  • Consider post-mortem debugging with mini-dump files.
  • Have fixed time frame for quick and dirty approach, then go with systematic approach.
  • Try code review defected module with one of your colleagues. Fresh view could be helpful.
  • Divide and conquer using your version control system (GIT, SVN).
  • Be careful about fixes, because around 4% of all fixes end up in introduction new bugs.
  • Don't let pressure for quick fixing bug in production to make you omit your standard quality control procedures (eg. code reviews).
  • After fixing make sure that you have written automated tests in case when bug would come back some time later.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文