宇宙射线:它们影响程序的概率是多少?
我再次进行设计审查,并遇到了这样的说法:特定场景的概率“小于宇宙射线影响程序的风险”,我突然意识到我根本不知道那是什么概率是。
“由于 2-128 是 340282366920938463463374607431768211456 中的 1,所以我认为我们有理由在这里冒险,即使这些计算偏差了数十亿倍......我们我相信,我们更容易被宇宙射线搞砸。”
这个程序员正确吗? 宇宙射线击中计算机并影响程序执行的概率是多少?
Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is.
"Since 2-128 is 1 out of 340282366920938463463374607431768211456, I think we're justified in taking our chances here, even if these computations are off by a factor of a few billion... We're way more at risk for cosmic rays to screw us up, I believe."
Is this programmer correct?
What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
维基百科引用了IBM 在 90 年代进行的研究表明“计算机通常每月每 256 MB RAM 会遇到一次宇宙射线引发的错误”。不幸的是,引用的是《科学美国人》上的一篇文章,该文章没有提供任何进一步的参考资料。就我个人而言,我发现这个数字非常高,但也许大多数由宇宙射线引起的内存错误不会导致任何实际或明显的问题。
另一方面,当谈到软件场景时,人们谈论概率通常不知道他们在谈论什么。
Wikipedia cites a study by IBM in the 90s suggesting that "computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month." Unfortunately the citation was to an article in Scientific American, which didn't give any further references. Personally, I find that number to be very high, but perhaps most memory errors induced by cosmic rays don't cause any actual or noticable problems.
On the other hand, people talking about probabilities when it comes to software scenarios typically have no clue what they are talking about.
好吧,宇宙射线显然导致了丰田汽车中的电子设备发生故障,所以我想说可能性非常高:)
宇宙射线真的造成了丰田的困境吗?
Well, cosmic rays apparently caused the electronics in Toyota cars to malfunction, so I would say that the probability is very high :)
Are cosmic rays really causing Toyota woes?
使用 ECC,您可以纠正宇宙射线的 1 位错误。为了避免 10% 的宇宙射线导致 2 位错误的情况,ECC 单元通常在芯片上交错,因此没有两个单元彼此相邻。因此,影响两个单元的宇宙射线事件将导致两个可纠正的 1 位错误。
Sun 指出:(部件号 816-5053-2002 年 4 月 10 日)
With ECC you can correct the 1 bit errors of Cosmic Rays. In order to avoid the 10% of cases where cosmic rays result in 2-bit-errors the ECC cells are typically interleaved over chips so no two cells are next to each other. A cosmic ray event which affects two cells will therefore result in two correctable 1bit errors.
Sun states: (Part No. 816-5053-10 April 2002)
内存错误是真实存在的,ECC 内存确实有帮助。正确实现的 ECC 内存将纠正单位错误并检测双位错误(如果检测到此类错误,则停止系统。)您可以从人们经常抱怨似乎是通过运行 Memtest86 并发现坏内存。当然,由宇宙射线引起的短暂故障与持续故障的内存不同,但它与更广泛的问题相关,即您应该在多大程度上相信您的内存能够正确运行。
基于 20 MB 驻留大小的分析可能适合普通应用程序,但大型系统通常具有多个具有大主内存的服务器。
有趣的链接:http://cr.yp.to/hardware/ecc.html
不幸的是,页面中的 Corsair 链接似乎已失效,因此 请在此处查看 Corsair 链接。
Memory errors are real, and ECC memory does help. Correctly implemented ECC memory will correct single bit errors and detect double bit errors (halting the system if such an error is detected.) You can see this from how regularly people complain about what seems to be a software problem that is resolved by running Memtest86 and discovering bad memory. Of course a transient failure caused by a cosmic ray is different to a consistently failing piece of memory, but it is relevant to the broader question of how much you should trust your memory to operate correctly.
An analysis based on a 20 MB resident size might be appropriate for trivial applications, but large systems routinely have multiple servers with large main memories.
Interesting link: http://cr.yp.to/hardware/ecc.html
The Corsair link in the page unfortunately seems to be dead, so view the Corsair link here instead.
这是一个现实问题,也是服务器和嵌入式系统中使用 ECC 内存的原因。以及为什么飞行系统与地面系统不同。
例如,请注意,用于“嵌入式”应用程序的英特尔部件往往会在规格表中添加 ECC。平板电脑的 Bay Trail 缺少它,因为它会使内存更贵并且可能更慢。如果平板电脑每隔一段时间就会崩溃一次,用户也不会太在意。无论如何,软件本身远不如硬件可靠。但对于用于工业机械和汽车的 SKU,ECC 是强制性的。从这里开始,我们期望软件更加可靠,并且随机干扰造成的错误将是一个真正的问题。
通过 IEC 61508 和类似标准认证的系统通常具有启动测试,检查所有 RAM 是否正常工作(没有位停留在 0 或 1),以及运行时的错误处理,尝试从 ECC 检测到的错误中恢复,以及通常,内存清理器任务也会连续遍历内存并读写内存,以确保注意到发生的任何错误。
但对于主流PC软件呢?没什么大不了的。对于长期存在的服务器?使用 ECC 和故障处理程序。如果一个不可纠正的错误杀死了内核,那就这样吧。或者您偏执地使用具有锁步执行的冗余系统,这样如果一个核心损坏,另一个核心可以在第一个核心重新启动时接管。
This is a real issue, and that is why ECC memory is used in servers and embedded systems. And why flying systems are different from ground-based ones.
For example, note that Intel parts destined for "embedded" applications tend to add ECC to the spec sheet. A Bay Trail for a tablet lacks it, since it would make the memory a bit more expensive and possibly slower. And if a tablet crashes a program every once in a blue moon, the user does not care much. The software itself is far less reliable than the HW anyway. But for SKUs intended for use in industrial machinery and automotive, ECC is mandatory. Since here, we expect the SW to be far more reliable, and errors from random upsets would be a real issue.
Systems certified to IEC 61508 and similar standards usually have both boot-up tests that check that all RAM is functional (no bits stuck at zero or one), as well as error handling at runtime that tries to recover from errors detected by ECC, and often also memory scrubber tasks that go through and read and write memory continuously to make sure that any errors that occur get noticed.
But for mainstream PC software? Not a big deal. For a long-lived server? Use ECC and a fault handler. If an uncorrectable error kills the kernel, so be it. Or you go paranoid and use a redundant system with lock-step execution so that if one core gets corrupted, the other one can take over while the first core reboots.
如果一个程序是生命攸关的(如果失败就会杀死某人),那么它需要以这样的方式编写:要么自动防故障,要么从此类故障中自动恢复。所有其他节目,YMMV。
丰田汽车就是一个很好的例子。说说你对油门拉线的看法,但它不是软件。
另请参阅http://en.wikipedia.org/wiki/Therac-25
If a program is life-critical (it will kill someone if it fails), it needs to be written in such a way that it will either fail-safe, or recover automatically from such a failure. All other programs, YMMV.
Toyotas are a case in point. Say what you will about a throttle cable, but it is not software.
See also http://en.wikipedia.org/wiki/Therac-25
“宇宙射线事件”在这里的许多答案中被认为具有均匀分布,但这可能并不总是正确的(即超新星)。虽然“宇宙射线”的定义(至少根据维基百科)来自外太空,但我认为也包括本地太阳风暴(又名日冕物质抛射)在同一保护伞下。我相信它可能会导致几个位在短时间内翻转,甚至可能足以损坏启用 ECC 的内存。
众所周知,太阳风暴会对电力系统造成相当大的破坏(例如 1989 年 3 月的魁北克停电事件) )。计算机系统很可能也会受到影响。
大约 10 年前,我坐在另一个人旁边,我们各自拿着笔记本电脑坐在一起,当时正值太阳天气相当“暴风雨”的时期(坐在北极,我们可以间接观察到这一点 - 大量的北极光被看见)。突然,就在同一时刻,我们的两台笔记本电脑都崩溃了。他运行的是 OS X,而我运行的是 Linux。我们都不习惯笔记本电脑崩溃——这在 Linux 和 OS X 上是非常罕见的事情。由于我们在不同的操作系统上运行(并且它不会在闰秒期间发生),因此可以或多或少地排除常见的软件错误。我将此事件归因于“宇宙辐射”。
后来“宇宙辐射”就成了我单位内部的一个笑话。每当我们的服务器发生问题而我们找不到任何解释时,我们就会开玩笑地将故障归因于“宇宙辐射”。 :-)
"cosmic ray events" are considered to have a uniform distribution in many of the answers here, this may not always be true (i.e. supernovas). Although "cosmic rays" by definition (at least according to Wikipedia) comes from outer space, I think it's fair to also include local solar storms (aka coronal mass ejection) under the same umbrella. I believe it could cause several bits to flip within a short timespan, potentially enough to corrupt even ECC-enabled memory.
It's well-known that solar storms can cause quite some havoc with electric systems (like the Quebec power outage in March 1989). It's quite likely that computer systems can also be affected.
Some 10 years ago I was sitting right next to another guy, we were sitting with each our laptops, it was in a period with quite "stormy" solar weather (sitting in the arctic, we could observe this indirectly - lots of aurora borealis to be seen). Suddenly - in the very same instant - both our laptops crashed. He was running OS X, and I was running Linux. Neither of us are used to the laptops crashing - it's a quite rare thing on Linux and OS X. Common software bugs can more or less be ruled out since we were running on different OS'es (and it didn't happen during a leap second). I've come to attribute that event to "cosmic radiation".
Later, "cosmic radiation" has become an internal joke at my workplace. Whenever something happens with our servers and we cannot find any explanation for it, we jokingly attribute the fault to "cosmic radiation". :-)
我曾经对在太空中飞行的设备进行了编程,然后你(据说,没有人给我看过任何关于它的论文,但据说这是业内的常识)可以预料到宇宙射线总是会引发错误。
I once programmed devices which were to fly in space, and then you (supposedly, noone ever showed me any paper about it, but it was said to be common knowledge in the business) could expect cosmic rays to induce errors all the time.
更常见的是,噪声会损坏数据。 校验和用于在多个层面上解决这个问题;在数据电缆中,通常有一个与数据一起传输的奇偶校验位。这大大降低了腐败的可能性。然后在解析级别上,无意义数据通常会被忽略,因此即使某些损坏确实通过了奇偶校验位或其他校验和,在大多数情况下也会被忽略。
此外,一些组件是电屏蔽以阻挡噪音(我猜可能不是宇宙射线)。
但最后,正如其他回答者所说,偶尔会有一些位或字节被扰乱,并且这是否是一个有效字节取决于机会。最好的情况是,宇宙射线扰乱了其中一个空位,并且完全没有任何影响,或者使计算机崩溃(这是一件好事,因为计算机不会造成伤害);但最坏的情况,嗯,我相信你可以想象。
More often, noise can corrupt data. Checksums are used to combat this on many levels; in a data cable there is typically a parity bit that travels alongside the data. This greatly reduces the probability of corruption. Then on parsing levels, nonsense data is typically ignored, so even if some corruption did get past the parity bit or other checksums, it would in most cases be ignored.
Also, some components are electrically shielded to block out noise (probably not cosmic rays I guess).
But in the end, as the other answerers have said, there is the occasional bit or byte that gets scrambled, and it's left up to chance whether that's a significant byte or not. Best case scenario, a cosmic ray scrambles one of the empty bits and has absolutely no effect, or crashes the computer (this is a good thing, because the computer is kept from doing harm); but worst case, well, I'm sure you can imagine.
我经历过这一点——宇宙射线翻转一点点并不罕见,但人们观察到这一点的可能性很小。
2004年,我正在为安装程序开发一个压缩工具。我的测试数据是一些解压后大约500 MB或更多的Adobe安装文件。
经过繁琐的压缩运行和解压运行以测试完整性后,FC /B 显示一个字节不同。
在那一个字节内,MSB 发生了翻转。
我也翻转了,担心我有一个疯狂的错误,只会在非常特定的条件下出现 - 我什至不知道从哪里开始寻找。
但有件事告诉我要再次运行测试。我跑了一下,结果就通过了。我设置了一个脚本在一夜之间运行测试 5 次。早上5点都过去了。
所以这绝对是宇宙射线位翻转。
I have experienced this - It's not rare for cosmic rays to flip one bit, but it's very unlikely that a person observe this.
I was working on a compression tool for an installer in 2004. My test data was some Adobe installation files of about 500 MB or more decompressed.
After a tedious compression run, and a decompression run to test integrity, FC /B showed one byte different.
Within that one byte the MSB had flipped.
I also flipped, worrying that I had a crazy bug that would only surface under very specific conditions - I didn't even know where to start looking.
But something told me to run the test again. I ran it and it passed. I set up a script to run the test 5 times overnight. In the morning all 5 had passed.
So that was definitely a cosmic ray bit flip.
作为一个数据点,这恰好发生在我们的构建中:
这看起来非常像编译期间发生的位翻转,偶然发生在源文件中非常重要的位置。
我不一定说这是“宇宙射线”,但症状相符。
As a data point, this just happened on our build:
That looks very strongly like a bit flip happening during a compile, in a very significant place in a source file by chance.
I'm not necessarily saying this was a "cosmic ray", but the symptom matches.
您可能还想看看容错硬件。
例如,Stratus Technology 构建了名为 ftServer 的 Wintel 服务器,该服务器有 2 或 3 个“主板”,以锁步方式比较计算结果。 (有时这也在太空飞行器中完成)。
Stratus 服务器从定制芯片组发展到背板上的同步。
一个非常相似的(但是软件的)系统是基于 Hypervisor 的 VMWare 容错锁步。
You might want to have a look at Fault Tolerant hardware as well.
For example Stratus Technology builds Wintel servers called ftServer which had 2 or 3 "mainboards" in lock-step, comparing the result of the computations. (this is also done in space vehicles sometimes).
The Stratus servers evolved from custom chipset to lockstep on the backplane.
A very similar (but software) system is the VMWare Fault Tolerance lockstep based on the Hypervisor.
来自维基百科:
这意味着每月每字节的概率为 3.7 × 10-9,即 1.4 × 10-15 每字节每秒。如果你的程序运行了1分钟,占用了20MB的RAM,那么失败的概率就是
错误检查可以帮助减少失败的后果。另外,正如 Joe 所言,由于芯片尺寸更加紧凑,故障率可能与 20 年前有所不同。
From Wikipedia:
This means a probability of 3.7 × 10-9 per byte per month, or 1.4 × 10-15 per byte per second. If your program runs for 1 minute and occupies 20 MB of RAM, then the failure probability would be
Error checking can help to reduce the aftermath of failure. Also, because of more compact size of chips as commented by Joe, the failure rate could be different from what it was 20 years ago.
显然,并非微不足道。来自这篇新科学家文章,引用英特尔专利申请:
您可以阅读此处为完整专利。
Apparently, not insignificant. From this New Scientist article, a quote from an Intel patent application:
You can read the full patent here.
注意:此答案与物理无关,而是与非 ECC 内存模块的静默内存错误有关。有些错误可能来自外太空,有些错误可能来自桌面的内部空间。
有一些关于大型服务器场(如 CERN 集群和 Google 数据中心)上的 ECC 内存故障的研究。具有 ECC 功能的服务器级硬件可以检测并纠正所有单位错误,并检测许多多位错误。
我们可以假设有很多非 ECC 台式机(以及非 ECC 移动智能手机)。如果我们检查 ECC 可纠正错误率(单位翻转)的论文,我们可以了解非 ECC 内存上的静默内存损坏率。
大规模 CERN 2007 年研究“数据完整性” :供应商宣称“其内存模块的位错误率为 10-12”,“观察到的错误率比预期低 4 个数量级 >”。对于数据密集型任务(8 GB/s 内存读取),这意味着每分钟可能发生一位翻转(10-12 供应商 BER)或每两天发生一次(10-16 BER)。
2009 年 Google 论文“野外 DRAM 错误:大规模现场研究” 表示每 Mbit 最多可以有 25000-75000 个一位 FIT(每十亿小时的故障时间),这相当于在我的 8GB RAM 后每小时 1 - 5 个位错误计算。论文也这么说:“每年每 GB 的平均可纠正错误率为 2000-6000 个”。
2012 年桑迪亚报告“大规模高性能计算静默数据损坏的检测和纠正”:“双位翻转被认为是不可能的”,但在 ORNL 的密集 Cray XT5 中,即使使用 ECC,它们也“以每天一次的速度运行 75,000 多个 DIMM”。而且单位错误应该更高。
因此,如果程序具有较大的数据集(几个 GB),或者具有较高的内存读取或写入速率(GB/s 或更高),并且运行了几个小时,那么我们可以预期桌面硬件上会出现多次静默位翻转。这个速率是memtest检测不到的,DRAM模块是好的。
长集群运行在数千台非 ECC PC 上,就像 BOINC 互联网范围内的网格计算总是会出现来自内存位翻转以及磁盘和网络静默错误的错误。
对于更大的机器(10,000 台服务器),即使有 ECC 保护以防止单位错误,正如我们在桑迪亚 2012 年报告中看到的那样,每天也可能会发生双位翻转,因此您将没有机会运行全尺寸并行程序持续几天(没有定期检查点,并在出现双重错误时从最后一个良好的检查点重新启动)。大型机器的高速缓存和 CPU 寄存器也会发生位翻转(架构和内部芯片的触发器,例如 ALU 数据路径中的触发器),因为并非所有机器都受到 ECC 保护。
PS:如果DRAM模块坏了,情况会更糟。例如,我在笔记本电脑中安装了新的 DRAM,几周后笔记本电脑就坏了。它开始出现很多内存错误。我得到的结果是:笔记本电脑挂起,Linux 重新启动,运行 fsck,在根文件系统上发现错误,并表示要在更正错误后重新启动。但每次重新启动时(我做了大约 5-6 次),根文件系统上仍然会发现错误。
Note: this answer is not about physics, but about silent memory errors with non-ECC memory modules. Some of errors may come from outer space, and some - from inner space of desktop.
There are several studies of ECC memory failures on large server farms like CERN clusters and Google datacenters. Server-class hardware with ECC can detect and correct all single bit errors, and detect many multi-bit errors.
We can assume that there is lot of non-ECC desktops (and non-ECC mobile smartphones). If we check the papers for ECC-correctable error rates (single bitflips), we can know silent memory corruptions rate on non-ECC memory.
Large scale CERN 2007 study "Data integrity": vendors declares "Bit Error Rate of 10-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For data-intensive tasks (8 GB/s of memory reading) this means that single bit flip may occur every minute (10-12 vendors BER) or once in two days (10-16 BER).
2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".
2012 Sandia report "Detection and Correction of Silent Data Corruptionfor Large-Scale High-Performance Computing": "double bit flips were deemed unlikely" but at ORNL's dense Cray XT5 they are "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.
So, if the program has large dataset (several GB), or has high memory reading or writing rate (GB/s or more), and it runs for several hours, then we can expect up to several silent bit flips on desktop hardware. This rate is not detectable by memtest, and DRAM modules are good.
Long cluster runs on thousands of non-ECC PCs, like BOINC internet-wide grid computing will always have errors from memory bit-flips and also from disk and network silent errors.
And for bigger machines (10 thousands of servers) even with ECC protection from single-bit errors, as we see in Sandia's 2012 report, there can be double-bit flips every day, so you will have no chance to run full-size parallel program for several days (without regular checkpointing and restarting from last good checkpoint in case of double error). The huge machines will also get bit-flips in their caches and cpu registers (both architectural and internal chip's triggers e.g. in ALU datapath), because not all of them are protected by ECC.
PS: Things will be much worse if the DRAM module is bad. For example, I installed new DRAM into laptop, which died several weeks later. It started to give lot of memory errors. What I get: laptop hangs, linux reboots, runs fsck, finds errors on root filesystem and says that it want to do reboot after correcting errors. But at every next reboot (I did around 5-6 of them) there are still errors found on the root filesystem.