当报告的软件问题并非真正的软件问题时
如果这已经被涵盖或者您认为它确实属于 wiki,我们深表歉意。
我是一家为生物科学行业制造微阵列印刷机的公司的软件开发人员。我主要参与通过 C++ 中的 GUI 开发与各种硬件(气动、液压、步进电机、传感器等)连接,以吸取样品并将其打印到微阵列载玻片上。
加入公司后,我注意到每当出现与硬件相关的问题时,这都会导致整个设置冻结,没有人更清楚具体问题是什么 - 硬件/软件/误用等。从那时起,我已经改进了通过引入软件超时和异常处理来更好地识别和处理出现的任何与硬件相关的问题,例如 PLC 命令未成功完成、不适当的 FPGA 响应命令以及各种其他死锁类型条件等。此外,软件现在将记录具体问题的摘要,通知用户并优雅地退出线程。该软件不是嵌入式的,只是使用串行端口进行连接。
尽管已经取得了一些成就,非软件人员仍然没有完全意识到,在这些情况下,他们向我报告的“软件”问题并不是真正的软件问题,而是软件正在报告问题,但不会导致问题它。不要误会我的意思,没有什么比像大量砖块一样解决软件错误并寻找以任何方式提高稳健性的方法更让我享受的了。我现在对这个系统已经足够了解了,我对这些事情几乎有了第六感。
无论我尝试解释多少次,都无法真正理解这一点。他们仍然将本质上是硬件问题(最终得到修复)的问题报告为软件问题。
我想听听其他经历过类似指责经历的人以及他们用什么方法来处理这些经历。
更新 这里有一些很棒的回应,几乎都是出自同一张赞美诗:更具描述性。我想识别命令并在硬件出现故障时彻底清除是第一阶段,但还不够。下一阶段是将对外行来说毫无意义的 PLC 命令映射为更具启发性的命令。 “PLC 命令 M71 超时”变为“无法初始化注射器系统。检查是否达到足够的真空”等等...
Apologies if this has already been covered or you think it really belongs on wiki.
I am a software developer at a company that manufactures microarray printing machines for the biosciences industry. I am primarily involved in interfacing with various bits of hardware (pneumatics, hydraulics, stepper motors, sensors etc) via GUI development in C++ to aspirate and print samples onto microarray slides.
On joining the company I noticed that whenever there was a hardware-related problem this would cause the whole setup to freeze, with nobody being any the wiser as to what the specific problem was - hardware / software / misuse etc. Since then I have improved things somewhat by introducing software timeouts and exception handling to better identify and deal with any hardware-related problems that arise eg PLC commands not successfully completed, inappropriate FPGA response commands, and various other deadlock type conditions etc. In addition, the software will now log a summary of the specific problem, inform the user and exit the thread gracefully. This software is not embedded, just interfacing using serial ports.
In spite of what has been achieved, non-software guys still do not fully appreciate that in these cases, the 'software' problem they are reporting to me is not really a software problem, rather the software is reporting a problem, but not causing it. Don't get me wrong, there is nothing I enjoy more than to come down on software bugs like a ton of bricks, and looking at ways of improving robustness in any way. I know the system well enough now that I almost have a sixth sense for these things.
No matter how many times I try to explain this, nothing really penetrates. They still report what are essentially hardware problems (which eventually get fixed) as software ones.
I would like to hear from any others that have endured similar finger-pointing experiences and what methods they used to deal with them.
UPDATE
Some great responses here that pretty much sing from the same hymn sheet: be more descriptive. I guess identifying the command and bombing out cleanly when the hardware fails was the first stage, but was still not quite enough. The next stage will be to map what are to the layman fairly meaningless PLC commands to something more suggestive. "PLC Command M71 timeout" becomes "Failure to initialize syringe system. Check adequate vacuum reached" and so on...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
也许当将问题作为消息或日志文件中的条目报告给用户时,您需要明确表明是硬件出了问题:
不幸的是,因为人们看到并与之交互的是软件,所以他们认为软件就是所有的东西。
Perhaps when reporting the problem either as a message to the user or an entry in the log file you need to make it explicitly clear that it's the hardware that's at fault:
Unfortunately, because it's the software that people see and interact with they assume that the software is all that there is.
您可以尝试将错误消息标记为“硬件问题”。可能会表达你的观点。
You could try labeling the error messages as "HARDWARE PROBLEM". Might get your point across.
系统中不存在非软件问题。软件是老板,老板不能把失败归咎于工具。
如果底层硬件出现故障,它应该向用户报告哪个组件到底出了什么问题。如果还是不行,那就是软件问题了。
例如,TCP 断开意味着必须重新连接。如果它是 FPGA 响应,它应该准确地告诉用户输入和输出是什么,以及谁应该受到责备。如果不是,则这是软件问题。
There's no such thing as non-software problem in a system. Software is the boss, and the boss cannot blame failure for the tools.
If underlying hardware is malfunctioning, it should report to the user what exactly went wrong with which component. If it didn't, it is a software problem.
For example, TCP disconnection means it have to reconnect. If it's an FPGA response, it should tell exactly what were the inputs and the outputs to the user, and who is to blame. If not, this is a software problem.
“如果你正在做的事情不起作用,请停止这样做并尝试其他事情”
正如其他评论中指出的那样,这是一个沟通问题,在较小程度上是一个感知问题。人们会更容易地指责他们不理解的事情,让自己感觉自己是受害者。如果馈线严重超载,电机可能会产生火花、起火和爆炸(每条警告都不要贴满电机)——但如果该软件停止响应,猜猜是什么导致了问题?
由于为每个用户提供 EE 和 CS 课程或 10 门课程是完全不可能的,因此请依靠良好的 ole 沟通。其基础是四件事(主要是我的意见),排名不分先后——你观察到什么,你感觉到什么,你想什么以及应该做什么。因此,有了这个想法,我将通过给出这个回应来付诸实践。
当某些底层硬件是关键问题时,您的用户似乎喜欢责怪软件(观察)。试图向用户解释这一点是不切实际的,也是浪费时间,这不是他们的工作,他们中的大多数人不会关心(感觉)。您可能想要尝试的是与工程团队讨论他们正在使用的部件,并研究通常与软件配合得更好的东西。也许有一些从未考虑过的输入限制? (思考)更换硬件或只是更好地理解它可能是真正的答案,以及更有针对性的错误和对这些用户的反馈(完成)。
"If what you're doing isn't working, stop doing it and try something else"
As pointed out in other comments, it's a communcation and to a lesser extent, perception problem. People will blame what they don't understand FAR more easily to make themselves feel like a victim. A motor could be sparking, throwing fire and explode from someone grossly overloading a feeder (with EVERY warning not to plastered all over it) -- but if that software stops responding, guess what caused the problem?
Since giving every one of your users a EE and CS class or 10 is completely out of the question, fall back on good ole communication. The basis of which is 4 things (mostly my opinion) in no particular order - What you observe, what you feel, what you think and what should be done. So with this idea, I'll put into practice by giving this response.
It seems like your users like to blame software when some of the underlying hardware is the key issue (observe). Trying to explain this with the users about this is impractical and a waste of time, that's not their job and most of them won't care (feel). What you may want to try is talking with the engineering team about the parts they're using and look into things that work better with software in general. Maybe there's some constraints of the inputs that were never considered? (think) Changing out the hardware or just a better understanding of it might be the real answer as well as more targeted errors and feedback to those users (done).
我同意其他海报的观点,但我想添加另一个观点:情况可能会更糟。他们可能花了几天或几周的时间试图解决硬件问题,然后当每个人都在枪口下并为没有得到解决而疯狂时才发现,他们正在解决错误的问题,而事实上,这是错误的。 ,软件问题。所以数数你的祝福吧。如果他们总是将其归类为软件问题,至少您知道这一点。只有这样你才能排除故障,也许可以添加额外的问题解决或问题识别代码,并使系统变得更好一点。
而且,这与世界各地的每个软件开发人员所面临的情况几乎相同。但通常是软件与用户,而不是软件与硬件。在这种情况下,似乎没有已知的解决方案。有很多方法可以解决问题,但没有办法解决它。因此,描述如何在不粗鲁的情况下责怪用户的缩写词列表不断增长:ID-ten-T 错误、PICNIC、PEBKAC 等。
I agree with the other posters, but I wanted to add another perspective: It could be worse. They could be attempting to solve the hardware problems for days or weeks, and then find out later, when everyone is under the gun and has been going crazy about it not getting fixed, that they were addressing the wrong problem and it was, in fact, a software problem. So count your blessings. If they always classify it as a software problem, at least you know about it. Only then can you troubleshoot, maybe put in additional problem-solving or problem-identifying code, and make the system a tiny bit better.
Also, this is pretty much the same as every software developer everywhere has ever faced. Except usually it is the software versus the user, not the software versus the hardware. And in that case, it appears there is no known solution. Lots of ways to address the problem, but no way to fix it. Thus the ever-growing list of acronyms describing how to blame the user without being rude: ID-ten-T error, PICNIC, PEBKAC, etc.
报告问题的人是谁?
如果是最终用户,我认为这不是问题。他们只知道他们试图做的事情是行不通的。诊断问题不是用户的责任。他们只知道,“我试图做 X,Y 应该发生,但 Z 却发生了。”除此之外的一切都是你的问题。
如果硬件人员坚持认为问题出在软件中,而软件人员坚持认为问题出在硬件中,那么您需要增强软件以更准确地诊断错误,正如 ChrisF 和其他人所指出的那样。
如果上级把硬件部门的问题归咎于软件部门,而你厌倦了为别人的错误承担责任,好吧,我理解。同样,作为软件人员,您有能力创建更精确的错误消息。如果您可以明确地说“步进电机没有响应”或其他什么,那么您就有“道德权威”坚持要求某人对步进电机进行诊断。只是说“我很确定这是硬件问题”并不能赢得争论。
Who is it who's reporting the problems?
If it's the end users, I think this is a non-issue. They just know that what they're trying to do is not working. It's not the user's responsibility to diagnose the problem. All they know is, "I tried to do X, Y should have happened, but instead Z happened." Everything beyond that is your problem.
If the hardware folks are insisting that the problem is in the software and the software folks are insisting that the problem is in the hardware, then you need to enhance the software to diagnose errors more precisely, as ChrisF and others have noted.
If the higher-ups are blaming the software group for problems that are the responsibility of the hardware group and you're sick of taking the blame for other people's mistakes, okay, I understand that. Again, as the software guy, you have the power to create more precise error messages. If you can explicitly say, "Stepper motor not responding" or whatever, then you have the "moral authority" to insist that someone run diagnostics on the stepper motor. Just saying, "I'm pretty sure it's a hardware problem" isn't going to win an argument.
面向测试的开发(不一定意味着“测试驱动”)是您应该资源配置的。
基本上,每个子系统都应该有一套相当彻底的单元测试,以便在集成之前识别问题。每次出现问题时,请测试硬件,以便您可以确定(或几乎确定)这是硬件问题。这意味着硬件的设计必须能够进行彻底的测试。
我是大学机器人团队的整合负责人,这种策略很有帮助。
希望这有帮助。
Test-oriented development (not necessary means 'test-driven') is want you should resourced to.
Basically, every sub-systems should have a reasonably thorough set of unit tests to identify problem before integration. Every time a problem occurs test the hardware so you can know for sure (or almost sure) that it is the hardware problem. This means that hardware must be designed in the way that it can be thoroughly tested.
I was a integration head for my college robot team and this tactic helps a lot.
Hope this helps.
首先,确保您的用户更有可能阅读并理解您的错误消息。显示“FPGA 命令 GS_WIDGIT_FROB 返回无效响应 0xFF45001C。关闭控制器 id 576D。(错误 1Xf)”可能对您很有帮助。但是,用户很可能不阅读它就点击“确定”。即使他们确实阅读了它,它也没有告诉他们任何有用的信息。不管怎样,你都会接到电话。显示“Widgit Frobber 需要维护”,但仍将所有重要详细信息记录在某处,您可能会接到更少的电话。
其次,您知道这是硬件问题,因此采取措施!让您的软件通过电子邮件发送硬件支持,或者采取任何措施来解决问题。如果用户被迫决定采取什么措施来修复它,你可以打赌他们至少在某些时候会出错。如果用户看到“Widgit Frobber 需要维护。已通知硬件支持(票号 #234)”,他们就知道他们不需要做任何事情。
First, make sure your users are more likely to read and understand your error messages. Displaying "FPGA command GS_WIDGIT_FROB returned invalid response 0xFF45001C. Shutting down controller id 576D. (Error 1Xf)" might be great for you. But, the user is likely to hit "Ok" without reading it. Even if they do read it, it tells them no useful information. Either way, you're getting a phone call. Display "Widgit Frobber requires maintenance", but still log all the heavy details somewhere, and you're likely to get less calls.
Second, you know it's a hardware problem so do something about it! Have your software email hardware support, or whatever it takes to get the problem fixed. If the user is forced to decide what action to take to fix it, you can bet they'll get it wrong at least some of the time. If the user sees "Widgit Frobber requires maintenance. Hardware support has been notified (ticket #234)" they know that they don't have to do a thing.