epoll_wait()接收socket关闭两次（read()/recv()返回0）

发布于 2024-10-12 10:43:23 字数 1409 浏览 12 评论 0原文

我们有一个使用 epoll 来侦听和处理 http 连接的应用程序。有时 epoll_wait() 会连续两次收到 fd 上的 close 事件。含义：epoll_wait() 返回连接 fd，其中 read()/recv() 返回 0。这是一个问题，因为我将 malloc:ed 指针保存在 epoll_event 结构（struct epoll_event.data.ptr）中，并且在 fd 时释放该指针（套接字）第一次被检测为关闭。第二次就崩溃了

这个问题在实际使用中很少出现（除了一个站点，该站点实际上每台服务器大约有 500-1000 个用户）。我可以使用每秒超过 1000 个并发连接的 http siege 来复制该问题。在这种情况下，应用程序段错误（由于无效指针）非常随机，有时在几秒钟后，通常在几十分钟后。我已经能够以每秒更少的连接来复制该问题，但为此我必须运行该应用程序很长时间、很多天，甚至几周。

所有新的accept()连接fd:s都被设置为非阻塞，并以一次性、边缘触发的方式添加到epoll中，并等待read()可用。那么为什么当服务器负载很高时，epoll 认为我的应用程序没有收到关闭事件并将新的事件放入队列？

epoll_wait() 在它自己的线程中运行，并将 fd 事件排队以在其他地方处理。我注意到有多个关闭传入，简单的代码检查是否有事件从 epoll 到同一个 fd 连续两次发生。它确实发生了，并且两者都关闭的事件（recv（..，MSG_PEEK）告诉我这一点:)）。

epoll fd 创建：

epoll_create(1024);

epoll_wait() 运行如下：

epoll_wait(epoll_fd, events, 256, 300);

新的fd在accept()之后被设置为非阻塞：

int flags = fcntl(fd, F_GETFL, 0);
err = fcntl(fd, F_SETFL, flags | O_NONBLOCK);

新的fd被添加到epoll中（客户端是malloc:ed结构体指针）：

static struct epoll_event ev;
ev.events = EPOLLIN | EPOLLONESHOT | EPOLLET;
ev.data.ptr = client;
err = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client->fd, &ev);

并且在接收和处理来自fd的数据后，它被重新武装（当然从EPOLLONESHOT开始）。起初我没有使用边缘触发和非阻塞 io，但我对其进行了测试并使用它们获得了很好的性能提升。不过，这个问题在添加它们之前就存在。顺便提一句。 shutdown(fd, SHUT_RDWR) 用于其他线程，当服务器由于某些 http 错误等而需要关闭 fd 时，触发通过 epoll 接收的正确关闭事件（我实际上不知道这是否是正确的方法）这样做，但效果很好）。

原文

We have an application that uses epoll to listen and process http-connections. Sometimes epoll_wait() receives close event on fd twice in a "row". Meaning: epoll_wait() returns connection fd on which read()/recv() returns 0. This is a problem, since I have malloc:ed pointer saved in epoll_event struct (struct epoll_event.data.ptr) and which is freed when fd(socket) is detected as closed the first time. Second time it crashes.

This problem occurs very rarely in real use (except one site, which actually has around 500-1000 users per server). I can replicate the problem using http siege with >1000 simultaneous connections per second. In this case application segfaults (because of invalid pointer) very randomly, sometimes after few seconds, usually after tens of minutes. I have been able to replicate the problem with fewer connections per second, but for that I have to run the application a long time, many days, even weeks.

All new accept() connection fd:s are set as non-blocking and added to epoll as one-shot, edge-triggering and waiting for read() to be available. So somewhy when the server load is high, epoll thinks that my application didn't get the close-event and queues new one?

epoll_wait() is running in it's own thread and queues fd events to be handled elsewhere. I noticed that there was multiple closes incoming with simple code that checks if there comes event twice in a row from epoll to same fd. It did happen and the events where both closes (recv(.., MSG_PEEK) told this to me :)).

epoll fd is created:

epoll_create(1024);

epoll_wait() is run as follows:

epoll_wait(epoll_fd, events, 256, 300);

new fd is set as non-blocking after accept():

int flags = fcntl(fd, F_GETFL, 0);
err = fcntl(fd, F_SETFL, flags | O_NONBLOCK);

new fd is added to epoll (client is malloc:ed struct pointer):

static struct epoll_event ev;
ev.events = EPOLLIN | EPOLLONESHOT | EPOLLET;
ev.data.ptr = client;
err = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client->fd, &ev);

And after receiving and handling data from fd, it is re-armed (of course since EPOLLONESHOT). At first I wasn't using edge-triggering and non-blocking io, but I tested it and got a nice perfomance boost using those. This problem existed before adding them though. Btw. shutdown(fd, SHUT_RDWR) is used on other threads to trigger proper close event to be received trough epoll when the server needs to close the fd because of some http-error etc (I don't actually know if this is the right way to do it, but it has worked perfectly).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜空下最亮的亮点 2024-10-19 10:43:23

一旦第一个 read() 返回 0，这意味着连接已被对等方关闭。为什么内核会在这种情况下生成 EPOLLIN 事件？好吧，当您仅订阅 EPOLLIN 时，没有其他方法可以指示套接字关闭。您可以添加 EPOLLRDHUP，这与检查 read() 返回 0 基本相同。但是，请确保在测试 EPOLLIN 之前测试此标志。

  if (flag & EPOLLRDHUP) {
     /* Connection was closed. */
     deleteConnectionData(...);
     close(fd); /* Will unregister yourself from epoll. */
     return;
  }

  if (flag & EPOLLIN) {
    readData(...);
  }

  if (flag & EPOLLOUT) {
    writeData(...);
  }

我对这些块进行排序的方式是相关的，并且 EPOLLRDHUP 的返回也很重要，因为 deleteConnectionData() 可能已经破坏了内部结构。由于 EPOLLIN 也是在关闭的情况下设置的，这可能会导致一些问题。忽略 EPOLLIN 是安全的，因为它无论如何都不会产生任何数据。对于 EPOLLOUT 也是如此，因为它从不与 EPOLLRDHUP 一起发送！

As soon as the first read() returns 0, this means that the connection was closed by the peer. Why does the kernel generate a EPOLLIN event for this case? Well, there's no other way to indicate the socket's closure when you're only subscribed to EPOLLIN. You can add EPOLLRDHUP which is basically the same as checking for read() returning 0. However, make sure to test for this flag before you test for EPOLLIN.

  if (flag & EPOLLRDHUP) {
     /* Connection was closed. */
     deleteConnectionData(...);
     close(fd); /* Will unregister yourself from epoll. */
     return;
  }

  if (flag & EPOLLIN) {
    readData(...);
  }

  if (flag & EPOLLOUT) {
    writeData(...);
  }

The way I've ordered these blocks is relevant and the return for EPOLLRDHUP is important too, because it is likely that deleteConnectionData() may have destroyed internal structures. As EPOLLIN is set as well in case of a closure, this could lead to some problems. Ignoring EPOLLIN is safe because it won't yield any data anyway. Same for EPOLLOUT as it's never sent in conjunction with EPOLLRDHUP!

回复收藏 0 原文