查找仅在性能测试下发生的堆损坏的最佳方法是什么?
我工作的软件(用 C++ 编写)目前存在堆损坏问题。当登录到盒子的用户数量达到一定阈值时,我们的性能测试团队不断收到 WER 错误,但他们给我的转储仅显示非正常区域的损坏(例如,当 std::string 释放其底层内存时) 。
我尝试过使用 Appverifier,这确实引发了许多问题,我现在已经修复了这些问题。然而,我现在所处的情况是,测试人员可以使用 Appverifier 尽可能多地加载机器并干净运行,但在没有 Appverifier 的情况下运行时仍然会出现堆损坏(我猜因为他们可以让更多用户在没有 Appverifier 的情况下使用等)。这意味着我无法获得实际显示问题的转储。
对于我可以使用的有用技巧或技术,有人有任何其他想法吗?我已经在没有应用程序验证程序的情况下对堆损坏转储进行了尽可能多的分析,但我看不到任何共同的主题。没有线程在崩溃的同时做任何有趣的事情,并且崩溃的线程是无辜的,这让我认为损坏发生在一段时间之前。
The software I work (written in C++) on has a heap corruption problem at the moment. Our perf test team keep getting WER faults when the number of users logged on to the box reaches a certain threshhold but the dumps they've given me just show corruptions in inoncent areas (like when std::string frees it's underlying memory for example).
I've tried using Appverifier and this did throw up a number of issues which I've now fixed. However I'm now in the situation where the testers can load up the machine as much as possible with Appverifier and have a clean run but still get heap corruption when running without Appverifier (I guess since they can get more users on etc without). This has meant I've been unable to get a dump which actually shows the problem.
Does anyone have any other ideas for useful techniques or technologies I can use? I've done as much analysis as I can on the heap corruption dumps without appverifier but I can't see any common themes. No threads doing anything intersting at the same time as the crash, and the thread which crashes is innocent which makes me think the corruption occured some time before.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
最好的工具是 Appverifier 与 gFlags 的结合,但还有许多其他解决方案可能会有所帮助。
例如,您可以使用以下代码指定每 16 个 malloc、realloc、free 和 _msize 操作进行一次堆检查:
The best tool is Appverifier in combination with gFlags but there are many other solutions that may help.
For example, you could specify a heap check every 16 malloc, realloc, free, and _msize operations with the following code:
我对你表示同情:这是一个非常难以追踪的问题。
正如您所说,这些通常发生在崩溃之前的某个时间,通常是由于写入行为不当(例如写入已删除的内存、超出数组末尾、超出 memcpy 中分配的内存等)。
过去(在 Linux 上,我猜你在 Windows 上)我使用过堆检查工具(valgrind、purify、intel 检查器),但正如你所观察到的,这些工具通常会影响性能,从而掩盖错误。 (您没有说明它是否是多线程应用程序,或处理可变数据集(例如传入消息)。
我还重载了 new 和 delete 运算符来检测双重删除,但这是一个非常具体的情况。
如果所有可用的工具都没有帮助,那么您就只能靠自己了,这将是一个漫长的调试过程。
对此我能提供的最佳建议是努力减少将重现它的测试场景。然后尝试减少正在执行的代码量,即删除部分功能。最终您会将问题归零,但我见过非常好的人花费 6 周或更长时间在大型应用程序(约 150 万个 LOC)上跟踪这些问题。
一切顺利。
You have my sympathies: a very difficult problem to track down.
As you say normally these occur some time prior to the crash, generally as the result of a misbehaving write (e.g. writing to deleted memory, running off the end of an array, exceeding the allocated memory in a memcpy, etc).
In the past (on Linux, I gather you're on Windows) I've used heap-checking tools (valgrind, purify, intel inspector) but as you've observed these often affect the performance and thus obscure the bug. (You don't say whether its a multi-threaded app, or processing a variable dataset such as incoming messages).
I have also overloaded the new and delete operators to detect double deletes, but this is quite a specific situation.
If none of the available tools help, then you're on you're own and its going to be a long debugging process.
The best advice for this I can offer is to work on reducing the test scenario which will reproduce it. Then attempt to reduce the amount of code being exercised, i.e. stubbing out parts of functionality. Eventually you'll zero-in on the problem, but I've seen very good guys spend 6 weeks or more tracking these down on a large application (~1.5 million LOC).
All the best.
您应该进一步详细说明您的软件的实际用途。是多线程的吗?当您谈论“登录到该盒子的用户数量”时,每个用户是否在不同的会话中打开您的软件的不同实例?您的软件是网络服务吗?实例之间是否可以相互通信(就像命名管道一样)?
如果您的错误仅在高负载时发生,而在 AppVerifier 运行时不会发生。我能想到的唯一两种可能性(没有更多信息)是您如何实现多线程的并发问题,或者测试机器存在仅在重负载下才会出现的硬件问题(您的测试人员使用了不止一台机器) ?)。
You should elaborate further on what your software actually does. Is it multi-threaded? When you talk about "number of users logged on to the box" does each user open a different instance of your software in a different session? Is your software a web service? Do instances talk to eachother (like with named pipes)?
If your error ONLY occurs at high load and does NOT occur when AppVerifier is running. The only two possibilities (without more information) that I can think of is a concurrency issue with how you've implemented multi-threading OR the test machine has a hardware issue that only manifests under heavy load (have your testers used more than one machine?).