Elixir主管在DNS超时之后不重新启动定时的Poolboy Genserver

发布于 2025-01-23 09:13:16 字数 2522 浏览 3 评论 0原文

我正在尝试将Poolboy用于工人池，以提出大量DNS请求。在其中一些DNS请求中，DNS查询时间出现了，这引发了错误并终止了Genserver工人：

07:44:29.585 [error] GenServer #PID<0.382.0> terminating
** (Socket.Error) timeout
    (socket 0.3.13) lib/socket/datagram.ex:46: Socket.Datagram.recv!/2
    (dns 2.3.0) lib/dns.ex:76: DNS.query/4
    (dmarc_hijack 0.1.0) lib/dmarc.ex:5: Dmarc.fetch_dmarc_record/1
    (dmarc_hijack 0.1.0) lib/dmarc_hijack/worker.ex:16: DmarcHijack.Worker.handle_call/3
    (stdlib 3.17.1) gen_server.erl:721: :gen_server.try_handle_call/4
    (stdlib 3.17.1) gen_server.erl:750: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.717.0>): {:fetch_process_dmarc, "12580.tv"}
State: nil
Client #PID<0.717.0> is dead

最终，这导致了我所有的Poolboy工人被杀害，而主管似乎并没有重新启动工人Genservers。应用功能就停止了，因为没有更多的工人，但是执行不会停止。

我正在尝试/捕获池任务中的错误以及DNS客户端：

Poolboy任务：

  defp setup_task(domain) do
    Task.async(fn ->
      :poolboy.transaction(
        :worker,
        fn pid ->
          try do
            GenServer.call(pid, {:fetch_process_dmarc, domain})
          catch :exit, reason ->
            # Handle timeout
            Logger.warning("Probably just got a timeout on #{domain}. Real reason follows:")
            Logger.warning(inspect(reason))
            {domain, {:error, :timeout}}
          end
        end,
        @timeout
      )
    end)
  end

DNS查询代码：

defmodule Dmarc do
  def fetch_dmarc_record(domain) do
    try do
      DNS.query("_dmarc.#{domain}", :txt, {select_random_dns_server(), 53})
      |> extract_dmarc_record_from_txt()
    catch error ->
        Logger.error(error)
        {:error, :timeout}

    end

  end

对我来说最有意义的是，我应该在制作DNS时处理DNS查询超时查询，但没有被Try/Catch Block处理。我认为这是因为recv！在超时上调用恐慌，绕过我的try/catch块，但在这里我可能错了。

基于我的理解，主管应重新启动终止的Genservers，但无论出于何种原因，他们都从未重新启动超时。

带有主管详细信息的应用程序配置，

defmodule DmarcHijack.Application do
  use Application

  defp poolboy_config do
    [
      name: {:local, :worker},
      worker_module: DmarcHijack.Worker,
      size: 5,
      max_overflow: 5
    ]
  end

  @impl true
  def start(_type, _args) do
    children = [
      DmarcHijack.ResultsBucket,
      :poolboy.child_spec(:worker, poolboy_config())

    ]

    opts = [strategy: :one_for_one, name: DmarcHijack.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

我非常感谢可用于调试此问题的任何帮助。谢谢！

原文

I'm trying to use Poolboy for a worker pool to make a large number of DNS requests. On some of these DNS requests, the DNS query times out, which throws an error and terminates the GenServer worker:

07:44:29.585 [error] GenServer #PID<0.382.0> terminating
** (Socket.Error) timeout
    (socket 0.3.13) lib/socket/datagram.ex:46: Socket.Datagram.recv!/2
    (dns 2.3.0) lib/dns.ex:76: DNS.query/4
    (dmarc_hijack 0.1.0) lib/dmarc.ex:5: Dmarc.fetch_dmarc_record/1
    (dmarc_hijack 0.1.0) lib/dmarc_hijack/worker.ex:16: DmarcHijack.Worker.handle_call/3
    (stdlib 3.17.1) gen_server.erl:721: :gen_server.try_handle_call/4
    (stdlib 3.17.1) gen_server.erl:750: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.717.0>): {:fetch_process_dmarc, "12580.tv"}
State: nil
Client #PID<0.717.0> is dead

Eventually, this leads to all of my Poolboy workers getting killed, and the Supervisor does not appear to restart the Worker GenServers. Application functionality then ceases as there are no more workers, but execution does not halt.

I'm try/catch-ing errors in the Poolboy task as well as the DNS client:

Poolboy task:

  defp setup_task(domain) do
    Task.async(fn ->
      :poolboy.transaction(
        :worker,
        fn pid ->
          try do
            GenServer.call(pid, {:fetch_process_dmarc, domain})
          catch :exit, reason ->
            # Handle timeout
            Logger.warning("Probably just got a timeout on #{domain}. Real reason follows:")
            Logger.warning(inspect(reason))
            {domain, {:error, :timeout}}
          end
        end,
        @timeout
      )
    end)
  end

DNS query code:

defmodule Dmarc do
  def fetch_dmarc_record(domain) do
    try do
      DNS.query("_dmarc.#{domain}", :txt, {select_random_dns_server(), 53})
      |> extract_dmarc_record_from_txt()
    catch error ->
        Logger.error(error)
        {:error, :timeout}

    end

  end

It makes the most sense to me that I should be handling the DNS query timeout at the point of making that DNS query, but it's not getting handled by the try/catch block. I think this is happening because the recv! call panics on a timeout, bypassing my try/catch block but I could be wrong here.

Based on my understanding, the supervisor should re-start the terminated GenServers but for whatever reason once they terminate from the timeout they are never restarted.

Application config with Supervisor details

defmodule DmarcHijack.Application do
  use Application

  defp poolboy_config do
    [
      name: {:local, :worker},
      worker_module: DmarcHijack.Worker,
      size: 5,
      max_overflow: 5
    ]
  end

  @impl true
  def start(_type, _args) do
    children = [
      DmarcHijack.ResultsBucket,
      :poolboy.child_spec(:worker, poolboy_config())

    ]

    opts = [strategy: :one_for_one, name: DmarcHijack.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

I'd really appreciate any help available to debug this issue. Thanks!

分享到QQ

分享到微博