线程间通信时间
我通过端口和接收器将 15 个异步操作链接在一起。这让我非常关心线程间消息传递时间,特别是任务将数据发送到端口和新任务开始在不同线程上处理相同数据之间所需的时间。假设每个线程在启动时处于空闲状态的最佳情况,我生成了一个测试,该测试使用秒表类来测量两个不同的调度程序的时间,每个调度程序都以单个线程的最高优先级运行。
让我惊讶的是,我的开发设备是运行 Windows 7 x64 的 Q6600 四核 2.4 Ghz 计算机,测试的平均上下文切换时间为 5.66 微秒,标准差为 5.738 微秒,最大值接近 1.58 毫秒( 282 的因数!)。秒表频率是 427.7 纳秒,所以我仍然没有受到传感器噪音的影响。
我想做的是尽可能减少线程间消息传递时间,同样重要的是,减少上下文切换的标准偏差。我意识到 Windows 不是实时操作系统,并且没有保证,但是 Windows 调度程序是一个基于公平循环优先级的调度,并且此测试中的两个线程都处于最高优先级(唯一应该是该线程的线程)高),所以线程上不应该有任何上下文切换(从最大时间 1.58 毫秒可以看出......我相信 Windows 量子是 15.65 毫秒?)我唯一能想到的是操作系统调用时间的变化CCR 用于在线程之间传递消息的锁定机制。
如果其他人测量了线程间消息传递时间,并对如何改进它有任何建议,请告诉我。
这是我测试的源代码:
using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using Microsoft.Ccr.Core;
using System.Diagnostics;
namespace Test.CCR.TestConsole
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Starting Timer");
var sw = new Stopwatch();
sw.Start();
var dispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "My Thread Pool");
var dispQueue = new DispatcherQueue("Disp Queue", dispatcher);
var sDispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "Second Dispatcher");
var sDispQueue = new DispatcherQueue("Second Queue", sDispatcher);
var legAPort = new Port<EmptyValue>();
var legBPort = new Port<TimeSpan>();
var distances = new List<double>();
long totalTicks = 0;
while (sw.Elapsed.TotalMilliseconds < 5000) ;
int runCnt = 100000;
int offset = 1000;
Arbiter.Activate(dispQueue, Arbiter.Receive(true, legAPort, i =>
{
TimeSpan sTime = sw.Elapsed;
legBPort.Post(sTime);
}));
Arbiter.Activate(sDispQueue, Arbiter.Receive(true, legBPort, i =>
{
TimeSpan eTime = sw.Elapsed;
TimeSpan dt = eTime.Subtract(i);
//if (distances.Count == 0 || Math.Abs(distances[distances.Count - 1] - dt.TotalMilliseconds) / distances[distances.Count - 1] > 0.1)
distances.Add(dt.TotalMilliseconds);
if(distances.Count > offset)
Interlocked.Add(ref totalTicks,
dt.Ticks);
if(distances.Count < runCnt)
legAPort.Post(EmptyValue.SharedInstance);
}));
//Thread.Sleep(100);
legAPort.Post(EmptyValue.SharedInstance);
Thread.Sleep(500);
while (distances.Count < runCnt)
Thread.Sleep(25);
TimeSpan exTime = TimeSpan.FromTicks(totalTicks);
double exMS = exTime.TotalMilliseconds / (runCnt - offset);
Console.WriteLine("Exchange Time: {0} Stopwatch Resolution: {1}", exMS, Stopwatch.Frequency);
using(var stw = new StreamWriter("test.csv"))
{
for(int ix=0; ix < distances.Count; ix++)
{
stw.WriteLine("{0},{1}", ix, distances[ix]);
}
stw.Flush();
}
Console.ReadKey();
}
}
}
I am chaining together 15 async operations through ports and receivers. This has left me very concerned with the interthread messaging time, specifically the time it takes between a task posting data to a port, and a new task begins processing that same data on a different thread. Assuming best case situation where each thread is idle at start, I have generated a test which uses the stop watch class to measure the time from two different dispatchers each operating at highest priority with a single thread.
What I found surprised me, my development rig is a Q6600 Quad Core 2.4 Ghz computer running Windows 7 x64, and the average context switch time from my test was 5.66 microseconds with a standard deviation of 5.738 microseconds, and a maximum of nearly 1.58 milliseconds (a factor of 282!). The Stopwatch Frequency is 427.7 nano seconds, so I am still well out of sensor noise.
What I would like to do is reduce the interthread messaging time as much as possible, and equally important, reduce the standard deviation of the context switch. I realize Windows is not a Real Time OS, and there are not guarantees, but the windows scheduler is a fair round robin priority based schedule, and the two threads in this test are both at the highest priority (the only threads that should be that high), so there should not be any context switches on the threads (evident by the 1.58 ms largest time... I believe windows quanta is 15.65 ms?) The only thing I can think of is variation in the timing of the OS calls to the locking mechanisms used by the CCR to pass messages between threads.
Please let me know if anyone else out there has measured interthread messaging time, and has any suggestions on how to improve it.
Here is the source code from my tests:
using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using Microsoft.Ccr.Core;
using System.Diagnostics;
namespace Test.CCR.TestConsole
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Starting Timer");
var sw = new Stopwatch();
sw.Start();
var dispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "My Thread Pool");
var dispQueue = new DispatcherQueue("Disp Queue", dispatcher);
var sDispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "Second Dispatcher");
var sDispQueue = new DispatcherQueue("Second Queue", sDispatcher);
var legAPort = new Port<EmptyValue>();
var legBPort = new Port<TimeSpan>();
var distances = new List<double>();
long totalTicks = 0;
while (sw.Elapsed.TotalMilliseconds < 5000) ;
int runCnt = 100000;
int offset = 1000;
Arbiter.Activate(dispQueue, Arbiter.Receive(true, legAPort, i =>
{
TimeSpan sTime = sw.Elapsed;
legBPort.Post(sTime);
}));
Arbiter.Activate(sDispQueue, Arbiter.Receive(true, legBPort, i =>
{
TimeSpan eTime = sw.Elapsed;
TimeSpan dt = eTime.Subtract(i);
//if (distances.Count == 0 || Math.Abs(distances[distances.Count - 1] - dt.TotalMilliseconds) / distances[distances.Count - 1] > 0.1)
distances.Add(dt.TotalMilliseconds);
if(distances.Count > offset)
Interlocked.Add(ref totalTicks,
dt.Ticks);
if(distances.Count < runCnt)
legAPort.Post(EmptyValue.SharedInstance);
}));
//Thread.Sleep(100);
legAPort.Post(EmptyValue.SharedInstance);
Thread.Sleep(500);
while (distances.Count < runCnt)
Thread.Sleep(25);
TimeSpan exTime = TimeSpan.FromTicks(totalTicks);
double exMS = exTime.TotalMilliseconds / (runCnt - offset);
Console.WriteLine("Exchange Time: {0} Stopwatch Resolution: {1}", exMS, Stopwatch.Frequency);
using(var stw = new StreamWriter("test.csv"))
{
for(int ix=0; ix < distances.Count; ix++)
{
stw.WriteLine("{0},{1}", ix, distances[ix]);
}
stw.Flush();
}
Console.ReadKey();
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Windows 不是实时操作系统。但你已经知道了。让你丧命的是上下文切换时间,而不一定是消息时间。您并没有真正指定进程间通信的工作原理。如果您确实只是运行多个线程,那么不使用 Windows 消息作为通信协议,您会发现一些好处,而是尝试使用应用程序托管的消息队列来滚动您自己的 IPC。
当上下文切换发生时,对于任何版本的 Windows,您可以期望的最佳平均值是 1 毫秒。当您的应用程序必须屈服于内核时,您可能会看到 1ms 的时间。这是为 Ring-1 应用程序(用户空间)设计的。如果绝对重要的是低于 1 毫秒,则需要将某些应用程序切换到 Ring-0,这意味着编写设备驱动程序。
设备驱动程序不会遇到与用户应用程序相同的上下文切换时间,并且还可以访问纳秒分辨率计时器和睡眠调用。如果您确实需要这样做,可以从 Microsoft 免费获得 DDK(设备驱动程序开发套件),但我强烈建议您投资购买第 3 方开发套件。他们通常有非常好的示例和大量向导来正确设置,这需要您花费数月的时间阅读 DDK 文档才能发现。您还需要像 SoftIce 这样的东西,因为普通的 Visual Studio 调试器不会帮助您调试设备驱动程序。
Windows is not a real-time OS. But you knew that already. What is killing you is the context switch times, not necessarily message times. You didn't really specify HOW your inter-processes communication works. If your really just running multiple threads, you'll find some gains by not using Windows message as a communication protocol, instead try rolling your own IPC using application hosted message queues instead.
The best average you can hope for is 1ms with any version of Windows when context switches occurs. Your probably seeing the 1ms times when your Application has to yield to the kernel. This is by design for Ring-1 applications (user-space). If it's absolutely critical that you get below 1ms you'll need to switch some of your application into Ring-0, which means writing a Device Driver.
Device Drivers don't suffer the same context-switch times that user apps do, and have access to nano-second resolution timers and sleep calls as well. If you do need to do this, the DDK (Device Driver Development Kit) is freely available from Microsoft, but I would HIGHLY recommend you invest in a 3rd party development kit. They usually have really good samples and lots of wizards to set things up right that would take you months of reading DDK documents to discover. You'll also want to get something like SoftIce because the normal Visual Studio debugger isn't going to help you debug Device Drivers.
这 15 个异步操作必须是异步的吗?即,您是否因某些库的限制而被迫以这种方式进行操作,或者您可以选择进行同步调用?
如果可以选择,则需要构建应用程序,以便通过配置参数控制异步性的使用。在不同线程上返回的异步操作与在同一线程上返回的同步操作之间的差异在代码中应该是透明的。这样你就可以在不改变代码结构的情况下调整它。
“令人尴尬的并行”一词描述了一种算法,其中正在完成的大部分工作显然是独立的,因此可以按任何顺序完成,从而易于并行化。
但是您“通过端口和接收器将 15 个异步操作链接在一起”。这可以用“尴尬的顺序”来形容。换句话说,同一个程序在逻辑上可以在单个线程上编写。但是,您将失去异步操作之间发生的 CPU 密集型工作的任何并行性(假设有任何重要意义)。
如果您编写一个简单的测试来删除任何重要的 CPU 密集型工作并仅测量上下文切换时间,那么您猜怎么着,您将测量上下文切换时间的变化,正如您所发现的那样。
运行多个线程的唯一原因是 CPU 需要完成大量工作,因此您希望在多个 CPU 之间共享这些工作。如果各个计算块的生命周期足够短,那么上下文切换对于任何操作系统都将是一个巨大的开销。通过将计算分为 15 个阶段,每个阶段都很短,您实际上是在要求操作系统进行大量不必要的上下文切换。
Do the 15 asynchronous operations have to be asynchronous? i.e. are you forced to operate this way by a limitation of some library, or do you have the choice to make synchronous calls?
If you have the choice, you need to structure your application so that the use of asynchronicity is controlled by configuration parameters. The difference between asynchronous operations that return on a different thread vs. synchronous operations that return on the same thread should be transparent in the code. That way you can tune it without changing the code structure.
The phrase "embarrassingly parallel" describes an algorithm in which the majority of the work being done is obviously independent and so can be done in any order, making it easy to parallelise.
But you are "chaining together 15 async operations through ports and receivers". This could be described as "embarrassingly sequential". In other words, the same program could be logically written on a single thread. But then you'd lose any parallelism for the CPU-bound work occuring between the async operations (assuming there is any of significance).
If you write a simple test to cut out any significant CPU-bound work and just measure the context switching time, then guess what, you're going to be measuring the variation in the context switching time, as you've discovered.
The only reason for running multiple threads is because you have significant amounts of work for CPUs to do, and so you'd like to share it out between several CPUs. If the individual chunks of computation are short-lived enough, then context switching will be a significant overhead on any OS. By breaking your computation down into 15 stages, each very short, you are essentially asking the OS to do a lot of unnecessary context switching.
ThreadPriority.Highest 并不意味着只有线程调度程序本身具有更高的优先级。 Win32 API 具有更细粒度的线程优先级 ( clicky),具有高于最高级别的多个级别(IIRC最高通常是非管理代码可以运行的最高优先级,管理员可以像任何硬件驱动程序/内核模式代码一样安排更高的优先级),因此不能保证他们会不被先发制人。
即使一个线程以最高优先级运行,如果其他线程持有更高优先级线程所需的资源锁,窗口也可以将其他线程提升到其基本优先级以上,这也是您可能会遇到上下文切换的另一种可能性。
即使如此,正如您所说,Windows 也不是实时操作系统,并且无论如何也不能保证遵守线程优先级。
ThreadPriority.Highest doesn't mean only the thread scheduler itself has a higher priority. The Win32 API has a more granular level of thread priority (clicky) with several levels above Highest (IIRC Highest is typically the highest priority non-admin code can be run at, admins can schedule higher priority as can any hardware drivers / kernel mode code) so there is no guarantee they will not be pre-empted.
Even if a thread is running with the highest priority windows can promote other threads above their base priority if they are holding resource locks that higher priority threads require which is another possibility why you may be suffering context switches.
Even then, as you say, Windows isn't a real time OS and it isn't guaranteed to honour thread priorities anyway.
为了以不同的方式解决这个问题,你需要有这么多解耦的异步操作吗?考虑一下可能会很有用:垂直分区工作(端到端异步处理 numCores 个数据块),而不是水平分区工作(像现在一样,每个数据块在 15 个解耦阶段中处理);同步耦合 15 个阶段中的一些阶段,以将总数减少到更小的数量。
线程间通信的开销始终是不小的。如果 15 个操作中的某些操作只完成一小部分工作,那么上下文切换就会影响到您。
To attack this problem a different way, do you need to have so many decoupled asynchronous operations? It may be useful to think about: vertically partitioning the work (asynchronously process numCores chunks of data end to end) instead of horizontally partitioning the work (as now, with each chunk of data processed in 15 decoupled stages); synchronously coupling some of your 15 stages to reduce the total to a smaller number.
The overhead of inter-thread communication will always be non-trivial. If some of your 15 operations are doing only a small chunk of work, the context switches will bite you.