Microsoft.NET 和 Doom 的多核 CPU
正确的问题
有没有人在单核机器上经历过这种异常?
由于线程退出或应用程序请求,I/O 操作已中止。
一些上下文
在单 CPU 系统上,无论线程如何,一次仅执行一个 MSIL 指令。 在操作之间,运行时会进行内务处理。
引入第二个CPU(或第二个核心),就可以在运行时执行内务处理的同时执行操作。 因此,在单 CPU 计算机上完美运行的代码在多核环境中执行时可能会崩溃,甚至引发蓝屏。
有趣的是,超线程奔腾没有表现出这个问题。
我的示例代码在单核上完美运行,但在多核 CPU 上却表现不佳。 它就在某个地方,但我仍在努力寻找它。 其要点是,当它被实现为访问者模式时,它会在不可预测的迭代次数后消失,但是将方法移到访问者操作的对象中使问题消失。
对我来说,这表明该框架具有某种用于解析对象引用的内部哈希表,并且在多核系统上存在与访问此哈希表有关的竞争条件。
我目前还有使用 APM 处理串行通信的代码。 它曾经在我的 USB 串行适配器的虚拟端口驱动程序内间歇性蓝屏,但我通过在每次 Stream.EndRead(IAsyncResult)
后执行 Thread.Sleep(0)
来修复此问题>
以随机间隔,当调用我提供给 Stream.BeginRead(...)
的 AsyncCallback 且处理程序尝试调用 Stream.EndRead(IAsyncResult)
时,它会抛出异常一个IOException
,表明由于线程退出或应用程序请求,I/O 操作已中止。
我怀疑这也与多核相关,并且某种原因内部错误正在杀死等待线程,导致此行为。 如果我的观点是正确的,那么该框架在多核环境中存在严重缺陷。 虽然有我提到的解决方法,但您不能总是应用它们,因为有时需要将它们应用到其他框架代码内。
例如,如果您在网上搜索上述 IOException,您会发现它影响了那些显然甚至不知道自己正在使用多线程的人编写的代码,因为它发生在框架便利包装器的掩护下。
微软倾向于将这些错误报告视为无法重现。 我怀疑这是因为该问题仅发生在多核系统和错误报告上,例如 这个没有提到CPU的数量。
所以...请帮我确定问题。 如果我是对的,我将必须能够用可重复的测试用例来证明它,因为我认为错误的是需要在框架和运行时进行错误修复。
有人建议问题更可能是我的代码而不是框架。
在调查该问题的变体 A 时,我已将问题代码移植到示例应用程序中,并对其进行了精简,直到只剩下在一个 CPU 上运行但在两个 CPU 上失败的线程设置和方法调用。
变体BI还没有这样测试过,因为我不再有任何单核系统。 所以我重复这个问题:有人在单核平台上看到过这个异常吗?
不幸的是没有人能证实我的怀疑,只能反驳它。
告诉我我容易犯错是没有帮助的,我已经意识到了这一点。
如果您知道一种将 .NET 应用程序固定到单个 CPU 的方法,那么解决这个问题将会非常方便。 ---感谢VM的建议。 我会这么做的,很好。
The question proper
Has anyone experienced this exception on a single core machine?
The I/O operation has been aborted because of either a thread exit or an application request.
Some context
On a single CPU system, only one MSIL instruction is executed at a time, threads notwithstanding. Between operations, the runtime gets to do its housekeeping.
Introduce a second CPU (or a second core) and it becomes possible to have an operation execute while the runtime does housekeeping. As a result, code that works perfectly on a single CPU machine may crash - or even induce a bluescreen - when executed in a multcore environment.
Interestingly, HyperThreaded Pentiums do not manifest the problem.
I had sample code that worked perfectly on a single core and flaked on a multicore CPU. It's around somewhere but I'm still trying to find it. The gist of it was that when it was implemented as Visitor pattern, it would flake after an unpredictable number of iterations, but moving the method into the object on which the visitor had operated made the problem disappear.
To me this suggests that the framework has some kind of internal hash table for resolving object references, and on a multicore system a race condition exists with respect to accessing this.
I also currently have code using APM to process serial comms. It used to intermittently bluescreen inside the virtual comport driver for my USB serial adaptor, but I fixed this by doing a Thread.Sleep(0)
after every Stream.EndRead(IAsyncResult)
At random intervals, when the AsyncCallback I supply to Stream.BeginRead(...)
is invoked and the handler tries to invoke Stream.EndRead(IAsyncResult)
, it throws an IOException
stating that The I/O operation has been aborted because of either a thread exit or an application request.
I suspect that this too is multicore related and that some sort of internal error is killing the wait thread, leading to this behaviour. If I am right about this then the framework has serious flaws in the context of a multicore environment. While there are workarounds such as I have mentioned, you can't always apply them because sometimes they need to be applied inside other framework code.
For example, if you search the net regarding the above IOException you will find it affecting code written by people who clearly don't even know they are using multiple threads because it happens under the covers of framework convenience wrappers.
Microsoft tends to blow off these bug reports as unreproduceable. I suspect this is because the problem only occurs on multicore systems and bug reports like this one don't mention the number of CPUs.
So... please help me pin down the problem. If I'm right about this I'm going to have to be able to prove it with repeatable test cases, because what I think is wrong is going to entail bugfixes in both framework and runtime.
It has been suggested that the problem is is more likely to be my code than the framework.
Investigating variant A of the issue, I have transplanted the problem code into a sample app and pared it down until the only things left were thread setup and method invocations that worked on one CPU and failed on two.
Variant B I have not so tested, because I no longer have any single core systems. So I repeat the question: has anyone seen this exception on a single core platform?
Unfortunately no-one can confirm my suspicion, only refute it.
It is not helpful to tell me that I am fallible, I am already aware of this.
If you know of a way to pin a .NET application to a single CPU it would be very handy for figuring this out. ---Thanks for the VM suggestion. I will do exactly that, good call.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
蓝屏不仅仅是由于应用程序或框架中的错误造成的。 蓝屏需要内核模式的“帮助”。 您的问题之一是有缺陷的驱动程序,无论有缺陷的驱动程序是在哪个“时代”编码的。
关于一个线程关闭端口而另一个线程仍在使用它的可能性,我认为这可能与一些著名的错误有关框架内务管理。 我认为这些错误并不取决于核心数量,但当核心数量更多时,受到这些错误影响的频率可能会增加。 尝试添加 GC.KeepAlive 调用以防止框架过早删除您的端口。
Blue screens aren't due solely to bugs in applications or frameworks. Blue screens need "help" from kernel mode. One of your problems is a defective driver, no matter which "era" the defective driver was coded in.
Regarding the possibility of one thread closing the port while another thread is still using it, I think this could be related to some famous bugs in framework housekeeping. I think those bugs don't depend on the number of cores, but the frequency of getting hit by those bugs could increase when there are more cores. Try adding a GC.KeepAlive call to prevent the framework from deleting your port too early.
我目前正在重写我们的应用程序中使用的整个文件传输堆栈。 从与其他工人的交谈中,我知道这种方法在几年前就已经有效,当时生产中使用的是单核笔记本电脑和低速连接。 现在每个人都转向双核和高速互联网,整个软件显示出不可预测的结果。
因此,当我开始更多地学习代码时,我发现开发它的人对如何正确编写多线程代码没有任何想法。 所有“同步”都是使用 Thread.Sleep() 完成的! 线程管理是在“即发即忘”的基础上完成的。 有人想停止线程吗? 线程.Abort()! 该死! 令人惊讶的是,这该死的东西竟然能起作用。
我的观点是——检查您的代码,如果您正在使用某些自定义硬件,请检查其驱动程序的代码。 问题就在那里,而不是在.NET、Win32 或其他地方。
I'm currently in the process of rewriting the whole file transfer stack that is used in our application. From conversations with other workers I know that it was kind of working couple of years ago, when single core laptops and slow-speed connections were used in production. Now everyone moved to dual cores and hispeed internet, and the whole software shows unpredictable results.
So, when I started learning the code more, I found that the person who developed it, had not a single idea of how to properly write multithreading code. All "synchronization" is done using Thread.Sleep()! Thread management was done on "fire and forget" basis. Someone wants to stop the thread? Thread.Abort()! Dammit! That's a surprise the damn thing was working at all.
My point is -- go and check your code, and if you're working with some custom hardware, their drivers' code. The problem is there, not in .NET, Win32 or somewhere else.
根据您的描述,我倾向于归咎于 COM 端口驱动程序。 它的驱动程序是在多核时代之前开发的吗? 我曾经在这样的设备上遇到过类似的问题,幸运的是后来的驱动程序修订版修复了该问题。
添加:要回答有关如何将应用程序限制为单个 CPU 的问题,您需要将进程关联设置为单个 CPU。 请参阅此链接。 您也可以在进程开始使用任务管理器后执行此操作(右键单击任务管理器中的进程并选择“设置关联性...”)
Based on your description, my inclination would be to blame the COM port driver. Was the driver for it developed prior to the multicore era? I once had a similar issue with such a device which a later driver revision thankfully fixed.
Addition: To answer your question on how to limit your app to a single CPU, you will need to set the process affinity to a single CPU. See this link. You can also do this after your process has started using task manager (right click on process in task manager and select "Set Affinity...")
在 Vista 之前,当发出异步 IO 的线程终止时,任何正在进行的异步 IO 都会被终止。 这往往会给出您报告的错误,即
我不确定这是否与您的问题相关,但是您是否从可以在操作完成之前终止的线程发出异步操作?
Prior to Vista any async IO that was in progress when the thread that issued it terminates is terminated. This tends to give the error that you report, i.e.
I'm not sure if this is in any way relevant to your question, but are you issuing asynchronous operations from a thread which can terminate before the operations have completed?
我在这里完全无言以对。 你说你的代码在双核机器上崩溃了,你怀疑微软的原因!!!
如今,每台机器都配备了双核甚至四核。 如果 .net 框架在使用双核时存在任何重大问题,那么为什么 live Messenger、Live writer 和许多其他 .net 厚应用程序不会经常崩溃。 我相信 SQL Server 2K5 和 2K8 管理工作室也在 .net 中。 整个 System.Web 实现都是用 C# 本身实现的。 整个 Biztalk 编排设计器都在 .net 中
现在进入正题。 您的应用程序似乎具有多线程和大量异步调用。 您是否可以灵活配置否。 您的应用程序中的线程数? 如果是的话可以限制线程数为1然后测试一下吗? 由于多线程而导致的错误非常难以追踪。
你试过SOS吗? 尝试这样做...我不太了解,但是 Google 一下,您肯定会获得有关 SOS 使用的良好资源。
作为最后的手段,请在 MS 支持下立案。 你需要对他们有点耐心,因为一开始他们会问一些愚蠢的问题:)。 祝你好运。
I am totally at loss of words here. You are telling that your code is breaking on dual core machines and you are suspecting MS for that!!!
Now a days every machine out there has got dual or even quad cores. If .net framework had any major issue working with dual cores then why live messenger, live writer and many other .net thick applications are not breaking frequently. I believe SQL Server 2K5 and 2K8 management studios are also in .net. Entire System.Web implementation is in C# itself. Entire Biztalk orchestration designer is in .net
Now coming to point. Your application seems to have multithreading and lots of async calls going up and down. Do you have flexibility to configure no. of threads in your application? If yes, can you limit the threads to 1 and then test it. Errors due to multithreading are very difficult to trace.
Have you tried SOS? Try doing that... I don't know it much but Google for it and you will certainly get good resources on usages of SOS.
As a final resort, open a case with MS support. You need to be little patient with them because at first they will start with all silly questions :). Good luck.