如何解决不可重现、随机且更改无法立即测试的问题?

发布于 2024-10-07 19:20:58 字数 583 浏览 0 评论 0原文

我想我应该把这个扔出去,看看其他人的经历是什么样的。

我在工作中遇到一个系统问题,它停止处理队列中的作业并且可以说“堵塞”。一旦服务重新启动,软件就会处理队列,一切都会恢复正常。

根据我迄今为止的经验,我一生都无法弄清楚是什么导致了这些停机。那,我自己无法重现停机情况。队列在不同的时间间隔都会失败,有时会连续运行一个月,有时会在一天内同时失败两次。从那以后,我让两个不同的供应商和部门内的各个同事都参与进来,每个人都被难住了,而且已经好几个月了。

自从我开始以来,我们已将处理隔离到单个服务器,并启动了发送给供应商的日志记录。两人都不知道问题是什么。

我们在这里和那里更新了一些设置,升级了客户端和服务器部分,但我们不知道我们正在做的事情是否有助于整体解决方案。

所以我遇到了一个似乎不可重现、随机且无法测试的问题。

有人遇到过类似的情况吗? 有哪些方法可以解决这样的情况?

任何分享的意见或经验都会很棒。

干杯,

编辑::加大了日志记录,将所有组件更新到最新版本,并确保完成了正确的防病毒排除,到目前为止,一个多月以来它还没有卡住!

Thought I would throw this one out there and see what other people's experiences have been like.

I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.

In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.

Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.

We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.

So I have a problem that appears to be unreproducible, random and untestable.

Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?

Any shared input or experiences would be great.

Cheers,

EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无人接听 2024-10-14 19:20:58

使用可以在生产中打开的日志记录框架。最初您可能需要进行太多日志记录,但这应该有助于缩小问题范围,当您越来越接近时,您可以缩小日志记录的范围,同时增加剩余日志语句的详细程度(是一个词)。

Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.

狼性发作 2024-10-14 19:20:58

除了凯利指出的日志记录之外,还有可能发生僵局,因为事情似乎停止了。如果这是 Java 应用程序,一种选择是使用 jconsole< /a> 并连接到 JVM 实例。 jconsole 有一个检测死锁选项,可以在发生挂起时提供非常有价值的信息。

如果这不是 Java 应用程序,也许是 .NET 应用程序,您可以使用此

In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.

If this is not a Java application and perhaps a .NET application you could make use of this technique.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文