如何修复无法复制的错误?
这个问题说明了一切。 如果你有一个被多个用户报告的Bug,但是日志中没有该Bug发生的记录,也无法重复该Bug,那么无论你如何努力,你该如何修复呢? 或者甚至可以吗?
我相信你们很多人都经历过这种情况。 在这种情况下你做了什么,最后的结果是什么?
编辑: 我更感兴趣的是如何处理无法找到的错误,而不是无法解决的错误。 无法解决的错误是指您至少知道存在问题,并且在大多数情况下有一个搜索问题的起点。 万一找不到了,怎么办? 你还能做点什么吗?
The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
修改您认为发生问题的代码,以便在某处记录额外的调试信息。 当下次发生这种情况时,您将拥有解决问题所需的一切。
modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.
如果它是 GUI 应用程序,那么观察客户生成错误(或尝试生成错误)非常有价值。 毫无疑问,他们会做一些你从未猜到他们在做的事情(没有错误,只是不同)。
否则,请将伐木集中在该区域。 记录大部分内容(您可以稍后将其取出)并让您的应用程序也转储其环境。 例如,机器类型、VM 类型、使用的编码。
您的应用程序是否报告版本号、内部版本号等? 您需要它来准确确定您正在调试(或不调试!)的版本。
如果您可以检测您的应用程序(例如,如果您在 Java 世界中,则可以使用 JMX),然后检测有问题的区域。 存储统计信息,例如请求+参数、完成的时间等。利用缓冲区来存储最后的“n”个请求/响应/对象版本/其他内容,并在用户报告问题时将其转储出来。
If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.
如果你无法复制它,你可以修复它,但无法知道你已经修复了它。
我已经对错误是如何触发的做出了最好的解释(即使我不知道这种情况是如何发生的),修复了这个问题,并确保如果错误再次出现,我们的通知机制会让未来的开发人员知道我希望知道的事情。 实际上,这意味着当可能触发错误的路径交叉时添加日志事件,并记录相关资源的指标。 当然,还要确保测试总体上能够很好地执行代码。
决定添加哪些通知是一个可行性和分类问题。 首先决定开发人员在错误上花费多少时间也是如此。 如果不知道这个 bug 有多重要,就无法回答这个问题。
我得到了好的结果(没有再次出现,并且代码更好),也得到了坏的结果(花了太多时间没有解决问题,无论错误最终是否得到修复)。 这就是估计和问题优先级的用途。
If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.
有时我只需要坐下来研究代码,直到找到错误。 尝试证明这个错误是不可能的,在这个过程中你可能会发现你可能错在哪里。 如果你真的成功地说服自己这是不可能的,那就假设你在某个地方搞砸了。
添加一堆错误检查和断言可能有助于确认或否认您的信念/假设。 有些事情可能会发生你从未预料到的失败。
Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.
这可能很困难,有时几乎是不可能的。 但我的经验是,如果你花足够的时间在上面,你迟早能够重现并修复错误(如果花的时间值得,那就是另一回事了)。
在这种情况下可能有帮助的一般建议。
It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
假设您已经添加了您认为有帮助的所有日志记录,但它没有帮助...您会想到两件事:
从报告的症状开始倒推。 想一想..“如果我想产生所报告的症状,我需要执行哪些代码,我将如何实现它,以及我将如何实现它?” D 导致 C 导致 B 导致 A。接受这样的事实:如果错误不可重现,那么普通方法将无济于事。 我不得不盯着代码好几个小时,不断地思考才能发现一些错误。 通常情况下,事实证明这是一件非常愚蠢的事情。
记住鲍勃的调试第一定律:如果你找不到某些东西,那是因为你找错了地方:-)
Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)
讨论问题、阅读代码,通常是相当多的事情。 我们通常成对进行,因为通常可以很快地通过分析消除可能性。
Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.
思考。 难的。 把自己锁起来,不许任何打扰。
我曾经遇到过一个错误,其证据是损坏数据库的十六进制转储。 指针链被系统地搞乱了。 所有用户的程序以及我们的数据库软件在测试中都运行良好。 我盯着它看了一周(这是一个重要的客户),在排除了数十种可能的想法后,我意识到数据分布在两个物理文件中,并且损坏发生在链跨越文件边界的地方。 我意识到,如果备份/恢复操作在关键点失败,两个文件最终可能会“不同步”,恢复到不同的时间点。 如果您随后在已经损坏的数据上运行客户的程序之一,它将准确地生成我所看到的打结的指针链。 然后,我演示了一系列事件,准确地再现了损坏情况。
Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.
进行随机更改,直到有效果:-)
Make random changes until something works :-)
首先查看您可以使用哪些工具。 例如,Windows 平台上的崩溃将转至 WinQual,因此,如果您遇到这种情况,您现在可以获得故障转储信息。 您可以使用静态分析工具来发现潜在的错误、运行时分析工具、分析工具吗?
然后看输入和输出。 当用户报告错误时,输入有什么类似的情况,或者输出中有什么不合适的地方吗? 编制报告列表并寻找模式。
最后,正如 David 所说,请关注代码。
Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.
有两种类型的错误是您无法复制的。 你发现的那种,别人发现的那种。
如果您发现了该错误,您应该能够复制它。 如果您无法复制它,那么您根本就没有考虑导致该错误的所有影响因素。 这就是为什么每当出现错误时,都应该记录它。 保存日志、获取屏幕截图等。如果不这样做,那么如何证明该错误确实存在? 或许这只是一段错误的记忆?
如果其他人发现了错误,而您无法复制它,那么显然请他们复制它。 如果他们无法复制它,那么你就尝试复制它。 如果你不能快速复制它,请忽略它。
我知道这听起来很糟糕,但我认为这是合理的。 复制别人发现的错误所需的时间非常长。 如果错误是真实的,那么它自然会再次发生。 有人,也许甚至是你,会再次偶然发现它。 如果很难复制,那么它也很少见,多发生几次可能不会造成太大的损害。
如果您花时间实际工作,修复其他错误并编写新代码,那么您的工作效率会比尝试复制甚至无法保证实际存在的神秘错误高得多。 只要等待它自然地再次出现,那么你就可以花所有的时间来修复它,而不是浪费时间试图揭示它。
There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.
有一些工具,例如 gotomeeting.com,您可以使用它们与用户共享屏幕并观察行为。 可能存在许多潜在的问题,例如计算机上安装的软件数量、某些工具实用程序与您的程序冲突。 我相信 gomeeting 不是唯一的解决方案,但可能存在超时问题、网速缓慢问题。
大多数时候,我会说软件不会向您报告正确的错误消息,例如,在 java 和 c# 跟踪每个异常的情况下......不要捕获所有异常,而是保留一个可以捕获和记录的点。 UI Bug 很难解决,除非使用远程桌面工具。 大多数时候,甚至第三方软件也可能存在错误。
There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.
要求用户向您提供对其计算机的远程访问权限并亲自查看所有内容。 要求用户制作一个小视频,展示他如何重现此错误并将其发送给您。
当然,这两种情况并不总是可能的,但如果是的话,可能会澄清一些事情。 查找错误的常见方法仍然是相同的:分离可能导致错误的部分,尝试了解发生的情况,缩小可能导致错误的代码空间。
Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.
如果您正在开发一个真正的大型应用程序,您可能会遇到 1,000 个错误,其中大多数肯定是可重现的。
因此,我担心我可能会以 WORKSFORME (Bugzilla) 的形式关闭该错误,然后继续修复一些更明显的错误。 或者做项目经理决定做的任何事情。
当然,进行随机更改是一个坏主意,即使它们是本地化的,因为您有引入新错误的风险。
If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.
语言
不同的编程语言都会有自己的错误。
C
添加调试语句可以使问题不可能重复,因为调试语句本身将指针移动得足够远避免 SEGFAULT——也称为 Heisenbugs。 指针问题很难跟踪和复制,但调试器可以提供帮助(例如 GDB 和 DDD)。
Java
具有多个线程的应用程序可能只会在非常特定的时间或事件序列中显示其错误。 不正确的并发实现可能会在难以复制的情况下导致死锁。
JavaScript
一些网络浏览器因内存泄漏而臭名昭著。 在一种浏览器中运行良好的 JavaScript 代码可能会在另一种浏览器中导致错误的行为。 使用经过数千名用户严格测试的第三方库可以有利于避免某些隐蔽的错误。
环境
根据应用程序(有错误)运行的环境的复杂性,唯一的办法可能是简化环境。 应用程序是否运行:
应用程序在什么环境下会产生问题?
退出无关的应用程序、终止后台任务、停止所有计划的事件(cron 作业)、消除插件以及卸载浏览器加载项。
网络
由于网络对于许多应用至关重要:
一致性
消除尽可能多的未知数:
消除生产、测试和开发之间的所有差异。 使用相同的硬件。 完全按照完全相同的步骤来设置计算机。 一致性是关键。
日志记录
使用大量日志记录来关联事件发生的时间。 检查日志是否有任何明显的错误、计时问题等。
硬件
如果软件看起来没问题,请考虑硬件故障:
主要针对嵌入式:
网络与本地
当您在本地(即不通过网络)运行应用程序时会发生什么? 其他服务器是否也遇到同样的问题? 数据库是远程的吗? 可以使用本地数据库吗?
固件
介于硬件和软件之间的是固件。
时间和统计
时间问题很难追踪:
收集有关问题的硬数值数据。 一个起初可能看起来是随机的问题,实际上可能有一个模式。
变更管理
有时,系统升级后会出现问题。
库管理
不同的操作系统有不同的方式来分发冲突的库:
执行操作系统的全新安装,并仅包含您的应用程序所需的支持软件。
Java
确保每个库只使用一次。 有时,应用程序容器具有与应用程序本身不同版本的库。 这可能无法在开发环境中复制。
使用库管理工具,例如 Maven 或 常春藤.
调试
代码 一种在错误发生时触发通知(例如日志、电子邮件、弹出窗口、寻呼机蜂鸣声)的检测方法。 使用自动化测试将数据提交到应用程序中。 使用随机数据。 使用涵盖已知和可能的边缘情况的数据。 最终该错误应该会再次出现。
睡眠
值得重申其他人提到的:睡觉吧。 花时间远离问题,完成其他任务(例如文档)。 远离电脑并进行一些锻炼。
代码审查
逐行浏览代码,并描述每一行对您自己、同事或 橡皮鸭。 这可能会导致人们了解如何重现该错误。
宇宙辐射
宇宙射线可以翻转位。 由于现代内存错误检查,这个问题不像过去那么大。 由于宇宙辐射的随机性,离开地球保护的硬件软件会遇到一些根本无法复制的问题。
工具
有时,尽管不常见,编译器会引入错误,特别是对于利基工具(例如,遭受符号表溢出的 C 微控制器编译器)。 是否可以使用不同的编译器? 工具链中的其他工具是否会引入问题?
Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
In what environment does the application produce the problem?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Consistency
Eliminate as many unknowns as possible:
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
And mostly for embedded:
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Time and Statistics
Timing issues are difficult to track:
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?