A good idea where practicable. Unfortunately, it is usually prohibitively difficult to keep track of the entire history of the state of the machine. You just can't tag each data structure with where you got it from, and the entire state of that object. You might be able to store just the external events and in that way reproduce where everything came from.
Some examples:
I did work on a project where it was practicable and it helped immensely. When we were getting close to shipping, and running out of bugs to fix, we would have our game play in "zero players mode", where the computer would repeatedly play itself all night long with all variations of characters and locales. If it asserted, it would display the random key that started the match. When we came to work in the morning we'd write the key down from our screen (there usually was one) and start it again using that key. Then we'd just watch it until the assert came up, and track it down. The important thing is that we could recreate all the original inputs that led to the error, and rerun it as many times as we wanted, even after recompiles (within limits... the number of fetches from the random number generator could not be changed, although we had a separate RNG for non-game stuff like visual fx). This only worked because each match started after a warm reboot and took only a very small amount of data as input.
I have heard that Bungie used a similar method to try to discover bad geometry in their Halo levels. They would set the dev kits running overnight in a special mode where the indestructable protagonist would move and jump randomly. In the morning they'd look and see if he got stuck in the geometry at some location where he couldn't get out. There may have been grenades involved, too.
On another project we actually logged all user interaction with a timestamp so we could replay it. That works great if you can, but most people have interactions with a changing DB whose entire state might not be stored so easily.
It's less vital with software. If something goes wrong in software, you can usually reproduce the fault and analyse it in captivity. Even if it only happens 1 time in 1000, you can often switch on all the logging and run it 1000 times (a simple soak test).
That's much more expensive and time-consuming on a manufacturing line, to the point of being impossible.
Having as much information available as possible the first time it goes wrong is no bad thing, but it's not as important to me as it is to Toyota.
This is a good approach. But be aware that you shouldn't over-do logging. Otherwise you couldn't find the interesting informations in all the noise, and it reduces the overall performance (e.g. anonymous object creation, depending on the language).
Producing error messages with a full stack trace is usually bad security practice. On the other hand, and more in line with Toyota's intent, every developed module should be traced back to the original programmer(s) - and they should be held accountable for shoddy work, bug fixes, security vulnerabilities, etc. Not for disciplinary purposes, but both maintenance, and education if necessary. And maybe for bonuses, in the contrary situation... ;-)
发布评论
评论(4)
在可行的情况下是个好主意。 不幸的是,跟踪机器状态的整个历史通常非常困难。 您只是无法用获取数据结构的位置以及该对象的整个状态来标记每个数据结构。 您也许可以只存储外部事件,并以这种方式重现所有内容的来源。
一些例子:
我确实参与了一个可行的项目,并且它有很大帮助。 当我们接近发货并且没有需要修复的错误时,我们将在“零玩家模式”下玩游戏,在这种模式下,计算机会整夜重复玩自己的角色和区域设置的各种变化。 如果断言,它将显示开始比赛的随机密钥。 当我们早上上班时,我们会从屏幕上写下该键(通常有一个),然后使用该键再次启动它。 然后我们就观察它直到断言出现,并追踪它。 重要的是,我们可以重新创建导致错误的所有原始输入,并根据需要重新运行多次,即使是在重新编译之后(在限制范围内......从随机数生成器获取的次数无法更改) ,尽管我们有一个单独的 RNG 用于非游戏内容,例如视觉特效)。 这之所以有效,是因为每场比赛都是在热重启后开始的,并且只需要很少量的数据作为输入。
我听说 Bungie 使用了类似的方法来尝试在 Halo 关卡中发现不良的几何形状。 他们会将开发套件设置为在特殊模式下过夜运行,坚不可摧的主角会随机移动和跳跃。 早上他们会查看他是否被困在几何图形中的某个位置而无法逃脱。 可能还涉及手榴弹。
在另一个项目中,我们实际上用时间戳记录了所有用户交互,以便我们可以重播它。 如果可以的话,这非常有效,但是大多数人都与不断变化的数据库进行交互,而数据库的整个状态可能不会那么容易存储。
A good idea where practicable. Unfortunately, it is usually prohibitively difficult to keep track of the entire history of the state of the machine. You just can't tag each data structure with where you got it from, and the entire state of that object. You might be able to store just the external events and in that way reproduce where everything came from.
Some examples:
I did work on a project where it was practicable and it helped immensely. When we were getting close to shipping, and running out of bugs to fix, we would have our game play in "zero players mode", where the computer would repeatedly play itself all night long with all variations of characters and locales. If it asserted, it would display the random key that started the match. When we came to work in the morning we'd write the key down from our screen (there usually was one) and start it again using that key. Then we'd just watch it until the assert came up, and track it down. The important thing is that we could recreate all the original inputs that led to the error, and rerun it as many times as we wanted, even after recompiles (within limits... the number of fetches from the random number generator could not be changed, although we had a separate RNG for non-game stuff like visual fx). This only worked because each match started after a warm reboot and took only a very small amount of data as input.
I have heard that Bungie used a similar method to try to discover bad geometry in their Halo levels. They would set the dev kits running overnight in a special mode where the indestructable protagonist would move and jump randomly. In the morning they'd look and see if he got stuck in the geometry at some location where he couldn't get out. There may have been grenades involved, too.
On another project we actually logged all user interaction with a timestamp so we could replay it. That works great if you can, but most people have interactions with a changing DB whose entire state might not be stored so easily.
对于软件来说,它就不那么重要了。 如果软件出现问题,您通常可以重现故障并进行分析。 即使这种情况只发生千分之一,您通常也可以打开所有日志记录并运行它 1000 次(一个简单的浸泡测试)。
这在生产线上要昂贵得多、耗时得多,甚至是不可能的。
在第一次出现问题时掌握尽可能多的信息并不是坏事,但它对我来说并不像对丰田那么重要。
It's less vital with software. If something goes wrong in software, you can usually reproduce the fault and analyse it in captivity. Even if it only happens 1 time in 1000, you can often switch on all the logging and run it 1000 times (a simple soak test).
That's much more expensive and time-consuming on a manufacturing line, to the point of being impossible.
Having as much information available as possible the first time it goes wrong is no bad thing, but it's not as important to me as it is to Toyota.
这是一个好方法。 但请注意,您不应该过度记录日志。 否则,您无法在所有噪音中找到有趣的信息,并且会降低整体性能(例如,匿名对象创建,具体取决于语言)。
This is a good approach. But be aware that you shouldn't over-do logging. Otherwise you couldn't find the interesting informations in all the noise, and it reduces the overall performance (e.g. anonymous object creation, depending on the language).
生成带有完整堆栈跟踪的错误消息通常是不好的安全实践。
另一方面,更符合丰田的意图的是,每个开发的模块都应该追溯到最初的程序员——他们应该对粗制滥造、错误修复、安全漏洞等负责。而不是出于纪律目的,但需要维护和教育。 也许是为了奖金,在相反的情况下......;-)
Producing error messages with a full stack trace is usually bad security practice.
On the other hand, and more in line with Toyota's intent, every developed module should be traced back to the original programmer(s) - and they should be held accountable for shoddy work, bug fixes, security vulnerabilities, etc. Not for disciplinary purposes, but both maintenance, and education if necessary. And maybe for bonuses, in the contrary situation... ;-)