为什么 ePoll 的扩展性比 Poll 更好?
简短的问题,但对我来说很难理解。
为什么 ePoll 的扩展性比 Poll 更好?
Short question but for me its difficult to understand.
Why exactly does ePoll scale better than Poll?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
poll 系统调用每次都需要将文件描述符列表复制到内核。这种情况仅在使用
epoll_ctl
时发生一次,但并非每次调用epoll_wait
时都会发生。另外,
epoll_wait
就监视的描述符数量而言是O(1)
1,这意味着您是否等待一个描述符并不重要描述符或 5,000 或 50,000 个描述符。poll
虽然比select
更高效,但仍然需要每次遍历列表(即,就以下方面而言,它是O(N)
描述符的数量)。最后,除了“正常”模式之外,epoll 还可以在“边缘触发”模式下工作,这意味着内核在收到就绪信号后不需要跟踪您读取了多少数据。这种模式更难掌握,但效率更高。
1As correctly pointed out by David Schwartz,
epoll_wait
is of course stillO(N)
in respect of events that occur. There is hardly a way it could be any different, with any interface. If N events happen on a descriptor that is watched, then the application needs to get N notifications, and needs to do N "things" in order to react on what's happening.This is again slightly, but not fundamentally different in edge triggered mode, where you actually get
M
events withM <= N
. In edge triggered mode, when the same event (say,POLLIN
) happens several times, you will probably get fewer notifications, possibly only a single one. However, this doesn't change much about the big-O notation as such.但是,
epoll_wait
与监视的描述符数量无关。假设它以预期的“正常”方式使用(即很多描述符,很少的事件),这才是真正重要的,在这里它确实是O(1)
。作为一个类比,您可以想到哈希表。哈希表在
O(1)
中访问其内容,但有人可能会说,计算哈希实际上是O(N)
密钥长度。这在技术上是绝对正确的,并且可能存在存在问题的情况,但是,对于大多数人来说,这并不重要。The poll system call needs to copy your list of file descriptors to the kernel each time. This happens only once with
epoll_ctl
, but not every time you callepoll_wait
.Also,
epoll_wait
isO(1)
in respect of the number of descriptors watched1, which means it does not matter whether you wait on one descriptor or on 5,000 or 50,000 descriptors.poll
, while being more efficient thanselect
, still has to walk over the list every time (i.e. it isO(N)
in respect of number of descriptors).And lastly, epoll can in addition to the "normal" mode work in "edge triggered" mode, which means the kernel does not need keep track of how much data you've read after you've been signalled readiness. This mode is more difficult to grasp, but somewhat more efficient.
1As correctly pointed out by David Schwartz,
epoll_wait
is of course stillO(N)
in respect of events that occur. There is hardly a way it could be any different, with any interface. If N events happen on a descriptor that is watched, then the application needs to get N notifications, and needs to do N "things" in order to react on what's happening.This is again slightly, but not fundamentally different in edge triggered mode, where you actually get
M
events withM <= N
. In edge triggered mode, when the same event (say,POLLIN
) happens several times, you will probably get fewer notifications, possibly only a single one. However, this doesn't change much about the big-O notation as such.However,
epoll_wait
is irrespective of the number of descriptors watched. Under the assumption that it is used in the intended, "normal" way (that is, many descriptors, few events), this is what really matters, and here it is indeedO(1)
.As an analogy, you can think of a hash table. A hash table accesses its content in
O(1)
, but one could argue that calculating the hash is actuallyO(N)
in respect of the key length. This is technically absolutely correct, and there probably exist cases where this is a problem, however, for most people, this just doesn't matter.虽然达蒙的理由对于从不阻塞套接字的异常情况是正确的,但在典型的现实世界程序中,原因完全不同。一个典型的程序如下所示:
1) 完成我们现在能做的所有工作。
2) 检查是否有网络连接需要服务,如果无事可做则阻塞。
3) 为发现的任何网络连接提供服务。
4) 转到步骤 1。
通常,因为您刚刚完成了所有能做的工作,所以当您转到步骤 2 时,没有任何工作需要您做。所以你需要稍等一下。现在,假设有 800 个您感兴趣的套接字。内核必须为这 800 个套接字中的每一个放入等待队列。而且,一瞬间,当数据到达这 800 个套接字之一时,内核必须将您从这 800 个等待队列中删除。将任务放入等待队列需要创建一个“thunk”来将该任务链接到该等待队列。不可能进行良好的优化,因为内核不知道您将等待哪 800 个套接字。
对于
epoll
,epoll套接字本身有一个等待队列,并且进程仅被放入该单个等待队列中。需要一个 thunk 将 800 个连接中的每一个链接到 epoll 等待队列,但该 thunk 是持久的。您可以通过将套接字添加到 epoll 集来创建它,并且它会一直保留在那里,直到您从该集中删除该套接字。当套接字上有活动时,内核会在检测活动的任务中处理它。当您等待时,内核已经知道是否检测到事件,并且内核只需将您放入该等待队列即可。当您醒来时,它只需将您从该队列中删除即可。
因此,复制并不是 select 或 poll 的杀手,而是内核必须在每个阻塞操作上操纵大量等待队列。
While Damon's reason is correct for the unusual case where you never block on a socket, in typical real-world programs, the reason is completely different. A typical program looks like this:
1) Do all the work we can do now.
2) Check if any network connections need service, blocking if there's nothing to do.
3) Service any network connections discovered.
4) Go to step 1.
Typically, because you just did all the work you can do, when you come around to step 2, there is no work for you to do. So you'll have to wait a bit. Now, imagine there are 800 sockets you are interested in. The kernel has to put on the wait queue for each of those 800 sockets. And, a split-second later when data arrives on one of those 800 sockets, the kernel has to remove you from those 800 wait queues. Placing a task on a wait queue requires creating a 'thunk' to link that task to that wait queue. No good optimizations are possible because the kernel has no idea which 800 sockets you'll be waiting for.
With
epoll
, the epoll socket itself has a wait queue, and the process is only put on that one single wait queue. A thunk is needed to link each of the 800 connections to the epoll wait queue, but that thunk is persistent. You create it by adding a socket to an epoll set, and it remains there until you remove the socket from the set.When there's activity on the socket, the kernel handles it in the task that detects the activity. When you wait, the kernel already knows if there's a detected event and the kernel only has to put you on that one wait queue. When you wake, it only has to remove you from that one queue.
So it's not so much the copying that's the killer with
select
orpoll
, it's the fact that the kernel has to manipulate a massive number of wait queues on each blocking operation.