是什么阻止操作系统从“蓝屏死机”中恢复?

发布于 2024-11-10 04:00:56 字数 390 浏览 7 评论 0原文

如果一个程序违反了它的指令路径和/或内存数据,操作系统会停止它并显示一些消息,因为该程序在操作系统的“虚拟机”空间中运行,并且无法确定其下一条指令。

操作系统本质上也是一个程序,与任何其他程序一样共享机器资源,并且可以以类似的方式停止它有时足够健康,可以显示一些调试信息和蓝屏。因此,作为一名程序员,我在想,如果我能做到这一点 - 发出调试信息并使屏幕变蓝,为什么我不能尝试完全恢复操作系统而不需要冷启动?毕竟它是操作系统 - 它应该是所有软件的坚如磐石的基础(当然不是谈论 Windows),如果航天飞机运行 Windows 那么会发生什么 - 它不会恢复?:)

所以:这只是MS 没有采取一切措施来恢复到不需要重新启动的程度,还是其他一些更深层次的问题阻止了像 MS 这样的公司无法做到这一点?

If a program violates its instruction path and/or memory data the OS halts it with some message due to the program running in the 'virtual machine' like space of the OS and its unable to determine its next instruction.

The OS in tern is also a program, sharing the machine resources as any other program and can halt in a similar fashion but it's sometimes healthy enough to display some debugging info and blue screen. So as a programmer I'm thinking, if I can do that - emit debugging info and make the screen blue why wouldn't I be able to try to recover the OS altogether instead of requiring a cold reboot ? After all its the OS - it's supposed to be the rock solid foundation (not talking about Windows of course) of all software, if the space shuttle ran Windows then what would happen - it won't recover ?:)

So: is it only that MS hasn't taken care of trying everything to recover to the point that a reboot is not required or is it some other more deeper problem that has stop companies like MS to be unable to do that ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

源来凯始玺欢你 2024-11-17 04:00:56

这并不是微软特有的。 Linux 有一个内核恐慌机制,OS X 有一个< href="http://en.wikipedia.org/wiki/File%3aMacOSX_kernel_panic.png" rel="nofollow">内核恐慌机制。我希望每个非玩具操作系统内核在检测到内部损坏时都有某种恐慌机制。损坏可能来自有缺陷的硬件、有缺陷的软件、伽马射线恰到好处地击中了内存板,谁知道呢。

内核恐慌背后的全部意义在于认识到一些不应该出错的事情已经出错了。还有什么可能无效?根据崩溃发生的位置,同步和卸载文件系统可能不安全,因为这可能会将损坏的数据覆盖在驱动器上的良好数据上。

写入视频卡是向用户通知事件的好方法(无论如何,许多系统都连接了监视器)并且写入视频卡不太可能损坏磁盘上的数据:对于 IOMMU 来说,这会产生相当大的错误或页表被严重损坏以至于它们引用磁盘上的文件并且大多数操作系统将在内核崩溃后拒绝写入块设备,以不惜一切代价尝试保护用户数据。

考虑一下您可以采取什么措施使系统恢复到运行状态?您需要拆除所有可能与损坏的内核数据结构相关的应用程序。您需要以正确的顺序重新启动应用程序,以使系统服务恢复。重新启动是可靠地完成这两件事的非常简单的方法。

It's nothing specific to Microsoft; Linux has a kernel panic mechanism, OS X has a kernel panic mechanism. I expect every non-toy operating system kernel has a panic mechanism of some sort when internal corruption is detected. The corruption could come from faulty hardware, faulty software, gamma rays hitting the memory boards just right, who knows.

The whole point behind the kernel panic is a recognition that something that shouldn't go wrong has gone wrong. What else might be invalid? Depending upon where the crash happened, it might not be safe to sync and unmount the filesystems because that might scribble corrupt data over good data on the drives.

Writing to the video card is a good way to inform the user of events (many systems have monitors attached, anyway) and writing to the video card isn't likely to corrupt on-disk data: it would take quite an error for the IOMMU or page tables to be so corrupted that they refer instead to on-disk files and most operating systems will refuse to write to block devices after a kernel panic to try to protect user data at all cost.

Consider what you could do to bring the system back up to a running state? You'd need to tear down all applications that might be associated with corrupted kernel data structures. You'd need to restart applications, in the right order, to bring system services back up. And a reboot is a very easy way to reliably do both those things.

笑脸一如从前 2024-11-17 04:00:56

您无法恢复操作系统的原因与用户空间程序无法恢复的原因相同——当看到某些类型的错误时,这意味着您的程序处于未定义状态,因此无法恢复。即使问题在某种意义上不是致命的(即不会导致程序立即终止),继续下去也是不安全的,因为事情已经或可能被损坏。

例如,无论是用户空间程序还是操作系统内核,缓冲区溢出或混乱的指针都会导致堆栈损坏。程序应该如何从中恢复?当当前执行的函数结束时,堆栈被烧毁,它会返回到哪里?退货地址可能已经消失了。现在怎么办?

不仅仅是微软。听说过 Unix 中的“内核恐慌”吗?

You can't recover the OS for the same reasons a user-space program can't recover -- when certain types of errors are seen it means that your program is in an undefined state and therefore can't recover. Even if the problem in some sense isn't fatal (i.e. doesn't cause the program to immediately die), it's not safe to continue because things are or are likely corrupted.

For example, be it a user-space program or the OS kernel, say a buffer overrun or an messed up pointer causes the stack to be corrupted. How is the program supposed to recover from that? With a blown stack when the function that is currently executing ends, where will it return to? The return address is likely gone. Now what?

And it's not just Microsoft. Ever hear of a "kernel panic" in Unix?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文