Erlang 套接字直到第二次 setopts {active,once} 才接收

发布于 2024-12-11 01:19:44 字数 2572 浏览 0 评论 0原文

首先我想道歉,我提供这么多信息是为了尽可能清楚地说明问题所在。如果还有什么需要澄清的地方,请告诉我。

(运行erlang R13B04,内核2.6.18-194,centos 5.5)

我有一个非常奇怪的问题。我有以下代码来侦听和处理套接字:

%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr, true}, {active, false},{keepalive,true}]).

%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
    case gen_tcp:accept(Listen) of
    {ok, Socket} ->
        Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
        gen_tcp:controlling_process(Socket,Pid);
    {error,_} -> do_nothing
    end,
    ?MODULE:accept_loop(Listen).

%Probably not relevant
process_sock(Sock) ->
    case inet:peername(Sock) of
    {ok,{Ip,_Port}} -> 
        case Ip of
        {172,16,_,_} -> Auth = true;
        _ -> Auth = lists:member(Ip,?PUB_IPS)
        end,
        ?MODULE:process_sock_loop(Sock,Auth);
    _ -> gen_tcp:close(Sock)
    end.

process_sock_loop(Sock,Auth) ->
    try inet:setopts(Sock,[{active,once}]) of
    ok ->
        receive
        {tcp_closed,_} -> 
            ?MODULE:prepare_for_death(Sock,[]);
        {tcp_error,_,etimedout} -> 
            ?MODULE:prepare_for_death(Sock,[]);

        %Not getting here
        {tcp,Sock,Data} ->
            ?MODULE:do_stuff(Sock,Data);

        _ ->
            ?MODULE:process_sock_loop(Sock,Auth)
        after 60000 ->
            ?MODULE:process_sock_loop(Sock,Auth)
        end;
    {error,_} ->
        ?MODULE:prepare_for_death(Sock,[]) 
    catch _:_ -> 
        ?MODULE:prepare_for_death(Sock,[])
    end.

整个设置工作正常,并且在过去几个月中一直有效。该服务器作为消息传递服务器运行,具有长期保持的 TCP 连接,平均保持约 100k 连接。然而现在我们正尝试更频繁地使用服务器。我们正在与 erlang 服务器建立两个长期保持的连接(将来可能会更多),并且每个连接每秒发出数百个命令。在常见情况下,这些命令中的每一个都会产生一个新线程,该线程可能会从 mnesia 进行某种读取,并基于此发送一些消息。

当我们尝试测试这两个命令连接时,奇怪的事情就出现了。当我们打开命令流时,任何新连接都有大约 50% 的机会挂起。例如,如果我要使用 netcat 连接并发送字符串“blahblahblah”,服务器应该立即返回错误。在执行此操作时,它不会在线程外部进行任何调用(因为它所做的只是尝试解析命令,这将失败,因为 blahblahblah 不是命令)。但大约 50% 的情况下(当两个命令连接正在运行时),键入 blahblahblah 会导致服务器在返回该错误之前停留 60 秒。

在尝试调试这个问题时,我拉出了wireshark。 tcp 握手总是立即发生,当来自客户端 (netcat) 的第一个数据包发送时,它会立即确认,告诉我内核的 tcp 堆栈不是瓶颈。我唯一的猜测是问题出在 process_sock_loop 函数中。它有一个接收,它将在 60 秒后返回到函数的顶部,并再次尝试从套接字获取更多内容。我最好的猜测是发生了以下情况:

  • 建立连接,线程移动到 process_sock_loop
  • {active,once} 设置
  • 线程接收,但即使数据存在也没有获取数据
  • 60 秒后线程返回到 process_sock_loop
  • {active, Once} 再次设置
  • 的顶部这次数据通过,一切正常进行

为什么会这样我不知道,当我们关闭这两个命令连接时,一切都会恢复正常问题消失了。

有什么想法吗?

First I would like to apologize, I'm giving so much information to make it as clear as possible what the problem is. Please let me know if there's still anything which needs clarifying.

(Running erlang R13B04, kernel 2.6.18-194, centos 5.5)

I have a very strange problem. I have the following code to listen and process sockets:

%Opts used to make listen socket
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true}, {reuseaddr, true}, {active, false},{keepalive,true}]).

%Acceptor loop which spawns off sock processors when connections
%come in
accept_loop(Listen) ->
    case gen_tcp:accept(Listen) of
    {ok, Socket} ->
        Pid = spawn(fun()->?MODULE:process_sock(Socket) end),
        gen_tcp:controlling_process(Socket,Pid);
    {error,_} -> do_nothing
    end,
    ?MODULE:accept_loop(Listen).

%Probably not relevant
process_sock(Sock) ->
    case inet:peername(Sock) of
    {ok,{Ip,_Port}} -> 
        case Ip of
        {172,16,_,_} -> Auth = true;
        _ -> Auth = lists:member(Ip,?PUB_IPS)
        end,
        ?MODULE:process_sock_loop(Sock,Auth);
    _ -> gen_tcp:close(Sock)
    end.

process_sock_loop(Sock,Auth) ->
    try inet:setopts(Sock,[{active,once}]) of
    ok ->
        receive
        {tcp_closed,_} -> 
            ?MODULE:prepare_for_death(Sock,[]);
        {tcp_error,_,etimedout} -> 
            ?MODULE:prepare_for_death(Sock,[]);

        %Not getting here
        {tcp,Sock,Data} ->
            ?MODULE:do_stuff(Sock,Data);

        _ ->
            ?MODULE:process_sock_loop(Sock,Auth)
        after 60000 ->
            ?MODULE:process_sock_loop(Sock,Auth)
        end;
    {error,_} ->
        ?MODULE:prepare_for_death(Sock,[]) 
    catch _:_ -> 
        ?MODULE:prepare_for_death(Sock,[])
    end.

This whole setup works wonderfully normally, and has been working for the past few months. The server operates as a message passing server with long-held tcp connections, and it holds on average about 100k connections. However now we're trying to use the server more heavily. We're making two long-held connections (in the future probably more) to the erlang server and making a few hundred commands every second per each of those connections. Each of those commands, in the common case, spawn off a new thread which will probably make some kind of read from mnesia, and send some messages based on that.

The strangeness comes when we try to test those two command connections. When we turn on the stream of commands, any new connection has about 50% chance of hanging. For instance, using netcat if I were to connect and send along the string "blahblahblah" the server should immediately return back an error. In doing this it won't make any calls outside the thread (since all it's doing is trying to parse the command, which will fail because blahblahblah isn't a command). But about 50% of the time (when the two command connections are running) typing in blahblahblah results in the server just sitting there for 60 seconds before returning that error.

In trying to debug this I pulled up wireshark. The tcp handshake always happens immediately, and when the first packet from the client (netcat) is sent it acks immediately, telling me that the tcp stack of the kernel isn't the bottleneck. My only guess is that the problem lies in the process_sock_loop function. It has a receive which will go back to the top of the function after 60 seconds and try again to get more from the socket. My best guess is that the following is happening:

  • Connection is made, thread moves on to process_sock_loop
  • {active,once} is set
  • Thread receives, but doesn't get data even though it's there
  • After 60 seconds thread goes back to the top of process_sock_loop
  • {active, once} is set again
  • This time the data comes through, things proceed as normal

Why this would be I have no idea, and when we turn those two command connections off everything goes back to normal and the problem goes away.

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

那些过往 2024-12-18 01:19:44

由于您对spawn的调用和对controlling_process的调用之间的竞争条件,您对set {active,once}的第一次调用可能会失败,

它会是间歇性的,可能取决于主机负载。

执行此操作时,我通常会生成一个阻止类似以下内容的函数:
{take,Sock}

,然后在袜子上调用循环,设置{active,once}。

因此,您需要更改要生成的接受器,设置controlling_process,然后设置Pid! {take,Sock}

达到这个效果。
注意:我不知道当你不是控制进程时 {active,once} 调用是否真的抛出,如果没有,那么我刚才说的就有意义了。

it's likely that your first call to set {active,once} is failing due to a race condition between your call to spawn and your call to controlling_process

it will be intermittent, likely based on host load.

When doing this, I'd normally spawn a function that blocks on something like:
{take,Sock}

and then call your loop on the sock, setting {active,once}.

so you'd change the acceptor to spawn, set controlling_process then Pid ! {take,Sock}

something to that effect.
note: I don't know if the {active,once} call actually throws when you aren't the controlling processes, if it doesn't, then what I just said makes sense.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文