Erlang 套接字直到第二次 setopts {active,once} 才接收

发布于 2024-12-11 01:19:44 字数 2572 浏览 0 评论 0原文

首先我想道歉，我提供这么多信息是为了尽可能清楚地说明问题所在。如果还有什么需要澄清的地方，请告诉我。

（运行erlang R13B04，内核2.6.18-194，centos 5.5）

我有一个非常奇怪的问题。我有以下代码来侦听和处理套接字：

%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr, true}, {active, false},{keepalive,true}]).

%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
    case gen_tcp:accept(Listen) of
    {ok, Socket} ->
        Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
        gen_tcp:controlling_process(Socket,Pid);
    {error,_} -> do_nothing
    end,
    ?MODULE:accept_loop(Listen).

%Probably not relevant
process_sock(Sock) ->
    case inet:peername(Sock) of
    {ok,{Ip,_Port}} -> 
        case Ip of
        {172,16,_,_} -> Auth = true;
        _ -> Auth = lists:member(Ip,?PUB_IPS)
        end,
        ?MODULE:process_sock_loop(Sock,Auth);
    _ -> gen_tcp:close(Sock)
    end.

process_sock_loop(Sock,Auth) ->
    try inet:setopts(Sock,[{active,once}]) of
    ok ->
        receive
        {tcp_closed,_} -> 
            ?MODULE:prepare_for_death(Sock,[]);
        {tcp_error,_,etimedout} -> 
            ?MODULE:prepare_for_death(Sock,[]);

        %Not getting here
        {tcp,Sock,Data} ->
            ?MODULE:do_stuff(Sock,Data);

        _ ->
            ?MODULE:process_sock_loop(Sock,Auth)
        after 60000 ->
            ?MODULE:process_sock_loop(Sock,Auth)
        end;
    {error,_} ->
        ?MODULE:prepare_for_death(Sock,[]) 
    catch _:_ -> 
        ?MODULE:prepare_for_death(Sock,[])
    end.

整个设置工作正常，并且在过去几个月中一直有效。该服务器作为消息传递服务器运行，具有长期保持的 TCP 连接，平均保持约 100k 连接。然而现在我们正尝试更频繁地使用服务器。我们正在与 erlang 服务器建立两个长期保持的连接（将来可能会更多），并且每个连接每秒发出数百个命令。在常见情况下，这些命令中的每一个都会产生一个新线程，该线程可能会从 mnesia 进行某种读取，并基于此发送一些消息。

当我们尝试测试这两个命令连接时，奇怪的事情就出现了。当我们打开命令流时，任何新连接都有大约 50% 的机会挂起。例如，如果我要使用 netcat 连接并发送字符串“blahblahblah”，服务器应该立即返回错误。在执行此操作时，它不会在线程外部进行任何调用（因为它所做的只是尝试解析命令，这将失败，因为 blahblahblah 不是命令）。但大约 50% 的情况下（当两个命令连接正在运行时），键入 blahblahblah 会导致服务器在返回该错误之前停留 60 秒。

在尝试调试这个问题时，我拉出了wireshark。 tcp 握手总是立即发生，当来自客户端 (netcat) 的第一个数据包发送时，它会立即确认，告诉我内核的 tcp 堆栈不是瓶颈。我唯一的猜测是问题出在 process_sock_loop 函数中。它有一个接收，它将在 60 秒后返回到函数的顶部，并再次尝试从套接字获取更多内容。我最好的猜测是发生了以下情况：

建立连接，线程移动到 process_sock_loop
{active,once} 设置
线程接收，但即使数据存在也没有获取数据
60 秒后线程返回到 process_sock_loop
{active, Once} 再次设置
的顶部这次数据通过，一切正常进行

为什么会这样我不知道，当我们关闭这两个命令连接时，一切都会恢复正常问题消失了。

有什么想法吗？

原文

First I would like to apologize, I'm giving so much information to make it as clear as possible what the problem is. Please let me know if there's still anything which needs clarifying.

(Running erlang R13B04, kernel 2.6.18-194, centos 5.5)

I have a very strange problem. I have the following code to listen and process sockets:

%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr, true}, {active, false},{keepalive,true}]).

%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
    case gen_tcp:accept(Listen) of
    {ok, Socket} ->
        Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
        gen_tcp:controlling_process(Socket,Pid);
    {error,_} -> do_nothing
    end,
    ?MODULE:accept_loop(Listen).

%Probably not relevant
process_sock(Sock) ->
    case inet:peername(Sock) of
    {ok,{Ip,_Port}} -> 
        case Ip of
        {172,16,_,_} -> Auth = true;
        _ -> Auth = lists:member(Ip,?PUB_IPS)
        end,
        ?MODULE:process_sock_loop(Sock,Auth);
    _ -> gen_tcp:close(Sock)
    end.

process_sock_loop(Sock,Auth) ->
    try inet:setopts(Sock,[{active,once}]) of
    ok ->
        receive
        {tcp_closed,_} -> 
            ?MODULE:prepare_for_death(Sock,[]);
        {tcp_error,_,etimedout} -> 
            ?MODULE:prepare_for_death(Sock,[]);

        %Not getting here
        {tcp,Sock,Data} ->
            ?MODULE:do_stuff(Sock,Data);

        _ ->
            ?MODULE:process_sock_loop(Sock,Auth)
        after 60000 ->
            ?MODULE:process_sock_loop(Sock,Auth)
        end;
    {error,_} ->
        ?MODULE:prepare_for_death(Sock,[]) 
    catch _:_ -> 
        ?MODULE:prepare_for_death(Sock,[])
    end.

This whole setup works wonderfully normally, and has been working for the past few months. The server operates as a message passing server with long-held tcp connections, and it holds on average about 100k connections. However now we're trying to use the server more heavily. We're making two long-held connections (in the future probably more) to the erlang server and making a few hundred commands every second per each of those connections. Each of those commands, in the common case, spawn off a new thread which will probably make some kind of read from mnesia, and send some messages based on that.

The strangeness comes when we try to test those two command connections. When we turn on the stream of commands, any new connection has about 50% chance of hanging. For instance, using netcat if I were to connect and send along the string "blahblahblah" the server should immediately return back an error. In doing this it won't make any calls outside the thread (since all it's doing is trying to parse the command, which will fail because blahblahblah isn't a command). But about 50% of the time (when the two command connections are running) typing in blahblahblah results in the server just sitting there for 60 seconds before returning that error.

In trying to debug this I pulled up wireshark. The tcp handshake always happens immediately, and when the first packet from the client (netcat) is sent it acks immediately, telling me that the tcp stack of the kernel isn't the bottleneck. My only guess is that the problem lies in the process_sock_loop function. It has a receive which will go back to the top of the function after 60 seconds and try again to get more from the socket. My best guess is that the following is happening: