Linux 上同一进程的线程之间上下文切换的成本
关于 Linux 上同一进程的线程之间的上下文切换成本,是否有任何好的经验数据(主要是 x86 和 x86_64)?我指的是一个线程在自愿或非自愿进入睡眠状态之前在用户空间中执行的最后一条指令与同一进程的不同线程在同一 CPU/核心上唤醒后执行的第一条指令之间的周期数或纳秒数。
我编写了一个快速测试程序,该程序在分配给同一 CPU/核心的 2 个线程中不断执行 rdtsc,将结果存储在 易失性变量中,并与其姐妹线程相应的易失性变量进行比较。当它第一次检测到姐妹线程的值发生变化时,它会打印差异,然后返回循环。我在 Atom D510 cpu 上以这种方式获得的最小/中值计数约为 8900/9600 个周期。这个程序看起来合理吗?这些数字看起来可信吗?
我的目标是估计在现代系统上,每个连接线程服务器模型是否可以与选择类型多路复用竞争甚至优于选择类型多路复用。这在理论上似乎是合理的,因为从在 fd X
上执行 IO 到 fd Y
的转换只涉及在一个线程中休眠并在另一个线程中唤醒,而不是多个系统调用,但这取决于上下文切换的开销。
Is there any good empirical data on the cost of context switching between threads of the same process on Linux (x86 and x86_64, mainly, are of interest)? I'm talking about the number of cycles or nanoseconds between the last instruction one thread executes in userspace before getting put to sleep voluntarily or involuntarily, and the first instruction a different thread of the same process executes after waking up on the same cpu/core.
I wrote a quick test program that constantly performs rdtsc
in 2 threads assigned to the same cpu/core, stores the result in a volatile variable, and compares to its sister thread's corresponding volatile variable. The first time it detects a change in the sister thread's value, it prints the difference, then goes back to looping. I'm getting minimum/median counts of about 8900/9600 cycles this way on an Atom D510 cpu. Does this procedure seem reasonable, and do the numbers seem believable?
My goal is to estimate whether, on modern systems, thread-per-connection server model could be competitive with or even outperform select-type multiplexing. This seems plausible in theory, as the transition from performing IO on fd X
to fd Y
involves merely going to sleep in one thread and waking up in another, rather than multiple syscalls, but it's dependent on the overhead of context switching.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
(免责声明:这不是问题的直接答案,只是一些我希望有所帮助的建议)。
首先,您得到的数字听起来确实在大致范围内。但请注意,在实现相同 ISA 的不同 CPU 型号之间,中断/陷阱延迟可能会有很大差异。如果您的线程使用了浮点或向量运算,情况也会有所不同,因为如果没有,内核会避免保存/恢复浮点或向量单元状态。
您应该能够通过使用内核跟踪基础设施获得更准确的数字 -
perf sched
< /a> 特别设计用于测量和分析调度程序延迟。如果您的目标是对每个连接的线程进行建模,那么您可能不应该测量非自愿上下文切换延迟 - 通常在这样的服务器中,大多数上下文切换都是自愿的,因为线程会阻塞在 read 中() 等待来自网络的更多数据。因此,更好的测试台可能需要测量从一个线程在 read() 中阻塞到另一个线程被唤醒的延迟。
请注意,在重负载下编写良好的多路复用服务器中,从 fd
X
到 fdY
的转换通常会涉及相同的单个系统调用(因为服务器会迭代从单个 epoll() 返回的活动文件描述符列表)。一个线程还应该比多个线程拥有更少的缓存占用空间,只需一个堆栈即可。我怀疑解决问题的唯一方法(对于“解决”的某些定义)可能是进行基准枪战......(Disclaimer: This isn't a direct answer to the question, it's just some suggestions that I hope will be helpful).
Firstly, the numbers you're getting certainly sound like they're within the ballpark. Note, however, that the interrupt / trap latency can vary a lot among different CPU models implementing the same ISA. It's also a different story if your threads have used floating point or vector operations, because if they haven't the kernel avoids saving/restoring the floating point or vector unit state.
You should be able to get more accurate numbers by using the kernel tracing infrastructure -
perf sched
in particular is designed to measure and analyse scheduler latency.If your goal is to model thread-per-connection servers, then you probably shouldn't be measuring involuntary context switch latency - usually in such a server, the majority of context switches will be voluntary, as a thread blocks in
read()
waiting for more data from the network. Therefore, a better testbed might involve measuring the latency from one thread blocking in aread()
to another being woken up from the same.Note that in a well-written multiplexing server under heavy load, the transition from fd
X
to fdY
will often involve the same single system call (as the server iterates over a list of active file descriptors returned from a singleepoll()
). One thread also ought to have less cache footprint than multiple threads, simply through having only one stack. I suspect the only way to settle the matter (for some definition of "settle") might be to have a benchmark shootout...