套接字 recv() 挂在带有 MSG_WAITALL 的大消息上

发布于 2024-12-20 23:31:20 字数 2933 浏览 0 评论 0原文

我有一个从服务器读取大文件并经常挂在特定计算机上的应用程序。它在RHEL5.2下已经成功运行了很长时间。我们最近升级到了 RHEL6.1,现在它经常挂起。

我创建了一个测试应用程序来重现该问题。在 100 次中,它大约挂起 98 次。

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/socket.h>
#include <sys/time.h>

int mFD = 0;

void open_socket()
{
  struct addrinfo hints, *res;
  memset(&hints, 0, sizeof(hints));
  hints.ai_socktype = SOCK_STREAM;
  hints.ai_family = AF_INET;

  if (getaddrinfo("localhost", "60000", &hints, &res) != 0)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  mFD = socket(res->ai_family, res->ai_socktype, res->ai_protocol);

  if (mFD == -1)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  if (connect(mFD, res->ai_addr, res->ai_addrlen) < 0)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  freeaddrinfo(res);
}

void read_message(int size, void* data)
{
  int bytesLeft = size;
  int numRd = 0;

  while (bytesLeft != 0)
  {
    fprintf(stderr, "reading %d bytes\n", bytesLeft);

    /* Replacing MSG_WAITALL with 0 works fine */
    int num = recv(mFD, data, bytesLeft, MSG_WAITALL);

    if (num == 0)
    {
      break;
    }
    else if (num < 0 && errno != EINTR)
    {
      fprintf(stderr, "Exit %d\n", __LINE__);
      exit(1);
    }
    else if (num > 0)
    {
      numRd += num;
      data += num;
      bytesLeft -= num;
      fprintf(stderr, "read %d bytes - remaining = %d\n", num, bytesLeft);
    }
  }

  fprintf(stderr, "read total of %d bytes\n", numRd);
}

int main(int argc, char **argv)
{
  open_socket();

  uint32_t raw_len = atoi(argv[1]);
  char raw[raw_len];

  read_message(raw_len, raw);

  return 0;
}

我的测试中的一些注释:

  • 如果“localhost”映射到环回地址 127.0.0.1,则应用程序会挂起对 recv() 的调用,并且永远不会返回。
  • 如果“localhost”映射到计算机的 IP,从而通过以太网接口路由数据包,则应用程序成功完成。
  • 当我遇到挂起时,服务器会发送“TCP Window Full”消息,客户端会用“TCP ZeroWindow”消息进行响应(请参阅图像和附加的 tcpdump 捕获)。从此时起,它会永远挂起,服务器发送保持活动状态,客户端发送 ZeroWindow 消息。客户端似乎从未扩展其窗口以允许传输完成。
  • 在挂起期间,如果我检查“netstat -a”的输出,则服务器发送队列中有数据,但客户端接收队列为空。
  • 如果我从 recv() 调用中删除 MSG_WAITALL 标志,应用程序将成功完成。
  • 仅在一台特定机器上使用环回接口时才会出现挂起问题。我怀疑这可能都与时间依赖性有关。
  • 当我减小“文件”的大小时,发生挂起的可能性就会降低

可以在此处找到测试应用程序的源代码:

Socket 测试源

从环回接口捕获的 tcpdump 可以在这里找到:

tcpdump捕获

我通过发出以下命令来重现该问题:

>  gcc socket_test.c -o socket_test
>  perl -e 'for (1..6000000){ print "a" }' | nc -l 60000
>  ./socket_test 6000000

这会看到 6000000 字节发送到测试应用程序,该应用程序尝试使用对 recv() 的单次调用来读取数据。

我很想听到关于我可能做错了什么的任何建议或任何进一步的方法来调试问题。

I have an application that reads large files from a server and hangs frequently on a particular machine. It has worked successfully under RHEL5.2 for a long time. We have recently upgraded to RHEL6.1 and it now hangs regularly.

I have created a test app that reproduces the problem. It hangs approx 98 times out of 100.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/socket.h>
#include <sys/time.h>

int mFD = 0;

void open_socket()
{
  struct addrinfo hints, *res;
  memset(&hints, 0, sizeof(hints));
  hints.ai_socktype = SOCK_STREAM;
  hints.ai_family = AF_INET;

  if (getaddrinfo("localhost", "60000", &hints, &res) != 0)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  mFD = socket(res->ai_family, res->ai_socktype, res->ai_protocol);

  if (mFD == -1)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  if (connect(mFD, res->ai_addr, res->ai_addrlen) < 0)
  {
    fprintf(stderr, "Exit %d\n", __LINE__);
    exit(1);
  }

  freeaddrinfo(res);
}

void read_message(int size, void* data)
{
  int bytesLeft = size;
  int numRd = 0;

  while (bytesLeft != 0)
  {
    fprintf(stderr, "reading %d bytes\n", bytesLeft);

    /* Replacing MSG_WAITALL with 0 works fine */
    int num = recv(mFD, data, bytesLeft, MSG_WAITALL);

    if (num == 0)
    {
      break;
    }
    else if (num < 0 && errno != EINTR)
    {
      fprintf(stderr, "Exit %d\n", __LINE__);
      exit(1);
    }
    else if (num > 0)
    {
      numRd += num;
      data += num;
      bytesLeft -= num;
      fprintf(stderr, "read %d bytes - remaining = %d\n", num, bytesLeft);
    }
  }

  fprintf(stderr, "read total of %d bytes\n", numRd);
}

int main(int argc, char **argv)
{
  open_socket();

  uint32_t raw_len = atoi(argv[1]);
  char raw[raw_len];

  read_message(raw_len, raw);

  return 0;
}

Some notes from my testing:

  • If "localhost" maps to the loopback address 127.0.0.1, the app hangs on the call to recv() and NEVER returns.
  • If "localhost" maps to the ip of the machine, thus routing the packets via the ethernet interface, the app completes successfully.
  • When I experience a hang, the server sends a "TCP Window Full" message, and the client responds with a "TCP ZeroWindow" message (see image and attached tcpdump capture). From this point, it hangs forever with the server sending keep-alives and the client sending ZeroWindow messages. The client never seems to expand its window, allowing the transfer to complete.
  • During the hang, if I examine the output of "netstat -a", there is data in the servers send queue but the clients receive queue is empty.
  • If I remove the MSG_WAITALL flag from the recv() call, the app completes successfully.
  • The hanging issue only arises using the loopback interface on 1 particular machine. I suspect this may all be related to timing dependencies.
  • As I drop the size of the 'file', the likelihood of the hang occurring is reduced

The source for the test app can be found here:

Socket test source

The tcpdump capture from the loopback interface can be found here:

tcpdump capture

I reproduce the issue by issuing the following commands:

>  gcc socket_test.c -o socket_test
>  perl -e 'for (1..6000000){ print "a" }' | nc -l 60000
>  ./socket_test 6000000

This sees 6000000 bytes sent to the test app which tries to read the data using a single call to recv().

I would love to hear any suggestions on what I might be doing wrong or any further ways to debug the issue.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

人海汹涌 2024-12-27 23:31:20

MSG_WAITALL 应该阻塞,直到收到所有数据。从 recv 手册页

此标志请求操作阻塞,直到满足完整请求为止。

但是,网络堆栈中的缓冲区可能不够大,无法容纳所有内容,这就是服务器上出现错误消息的原因。客户端网络堆栈根本无法容纳那么多数据。

解决方案是增加缓冲区大小(setsockoptSO_RCVBUF 选项)、将消息分割成更小的块,或者接收更小的块并将其放入您自己的缓冲区中。最后一个是我推荐的。

编辑:我在您的代码中看到您已经按照我的建议进行了操作(使用自己的缓冲读取较小的块),因此只需删除 MSG_WAITALL 标志即可。

哦,当 recv 返回零时,这意味着另一端已关闭连接,您也应该这样做。

MSG_WAITALL should block until all data has been received. From the manual page on recv:

This flag requests that the operation block until the full request is satisfied.

However, the buffers in the network stack probably are not large enough to contain everything, which is the reason for the error messages on the server. The client network stack simply can't hold that much data.

The solution is either to increase the buffer sizes (SO_RCVBUF option to setsockopt), split the message into smaller pieces, or receiving smaller chunks putting it into your own buffer. The last is what I would recommend.

Edit: I see in your code that you already do what I suggested (read smaller chunks with own buffering,) so just remove the MSG_WAITALL flag and it should work.

Oh, and when recv returns zero, that means the other end have closed the connection, and that you should do it too.

失眠症患者 2024-12-27 23:31:20

考虑以下两个可能的规则:

  1. 接收方可能会等待发送方发送更多内容,然后

  2. 发送方可以等待接收方收到已经发送的内容,然后再发送更多内容。

我们可以拥有这些规则中的任何一个,但我们不能同时拥有这两个规则。

为什么?因为如果允许接收方等待发送方,则意味着发送方不能等待接收方接收后再发送更多内容,否则我们会陷入僵局。如果允许发送方等待接收方,则意味着接收方在接收更多内容之前不能等待发送方发送,否则我们会死锁。

如果这两件事同时发生,我们就会陷入僵局。在接收者收到已经发送的内容之前,发送者不会发送更多内容,除非发送者发送更多内容,否则接收者也不会收到已经发送的内容。繁荣。

TCP 选择规则 2(原因应该是显而易见的)。因此它不能支持规则1。但是在您的代码中,您是接收者,并且您正在等待发送者发送更多内容,然后才能收到已发送的内容。所以这会陷入僵局。

Consider these two possible rules:

  1. The receiver may wait for the sender to send more before receiving what has already been sent.

  2. The sender may wait for the receiver to receive what has already been sent before sending more.

We can have either of these rules, but we cannot have both of these rules.

Why? Because if the receiver is permitted to wait for the sender, that means the sender cannot wait for the receiver to receive before sending more, otherwise we deadlock. And if the sender is permitted to wait for the receiver, that means the receiver cannot wait for the sender to send before receiving more, otherwise we deadlock.

If both of these things happen at the same time, we deadlock. The sender will not send more until the receiver receives what has already been sent, and the receiver will not receive what has already been sent unless the sender send more. Boom.

TCP chooses rule 2 (for reasons that should be obvious). Thus it cannot support rule 1. But in your code, you are the receiver, and you are waiting for the sender to send more before you receive what has already been sent. So this will deadlock.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文