Erlang OTP 应用程序设计

发布于 2024-10-21 16:52:14 字数 1019 浏览 1 评论 0原文

当我将一些代码转换为 OTP 应用程序时,我在掌握 OTP 开发模型方面遇到了一些困难。

我本质上是在制作一个网络爬虫,但我不太知道在哪里放置执行实际工作的代码。

我有一个启动我的工作人员的主管:

-behaviour(supervisor).
-define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).

init(_Args) ->          
  Children = [
    ?CHILD(crawler, worker)
  ],  
  RestartStrategy = {one_for_one, 0, 1},
  {ok, {RestartStrategy, Children}}.

在这个设计中,爬虫工作人员负责完成实际工作:

-behaviour(gen_server).

start_link() ->
  gen_server:start_link(?MODULE, [], []).

init([]) ->
  inets:start(),        
  httpc:set_options([{verbose_mode,true}]), 
  % gen_server:cast(?MODULE, crawl),
  % ok = do_crawl(),
  {ok, #state{}}.

do_crawl() ->
  % crawl!
  ok.

handle_cast(crawl}, State) -> 
  ok = do_crawl(),
  {noreply, State};

do_crawl产生相当多的进程和请求,通过http处理爬行工作。

最终的问题是:实际的爬行应该在哪里发生?从上面可以看出,我一直在尝试不同的方式来触发实际工作,但仍然缺少一些对于理解事物组合方式所必需的概念。

注意:为简洁起见,省略了一些 OTP 管道 - 管道都在那里,系统全部挂在一起

I am struggling a little coming to grips with the OTP development model as I convert some code into an OTP app.

I am essentially making a web crawler and I just don't quite know where to put the code that does the actual work.

I have a supervisor which starts my worker:

-behaviour(supervisor).
-define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).

init(_Args) ->          
  Children = [
    ?CHILD(crawler, worker)
  ],  
  RestartStrategy = {one_for_one, 0, 1},
  {ok, {RestartStrategy, Children}}.

In this design, the Crawler Worker is then responsible for doing the actual work:

-behaviour(gen_server).

start_link() ->
  gen_server:start_link(?MODULE, [], []).

init([]) ->
  inets:start(),        
  httpc:set_options([{verbose_mode,true}]), 
  % gen_server:cast(?MODULE, crawl),
  % ok = do_crawl(),
  {ok, #state{}}.

do_crawl() ->
  % crawl!
  ok.

handle_cast(crawl}, State) -> 
  ok = do_crawl(),
  {noreply, State};

do_crawl spawns a fairly large number of processes and requests that handle the work of crawling via http.

Question, ultimately is: where should the actual crawl happen? As can be seen above I have been experimenting with different ways of triggering the actual work, but still missing some concept essential for grokering the way things fit together.

Note: some of the OTP plumbing is left out for brevity - the plumbing is all there and the system all hangs together

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

毁梦 2024-10-28 16:52:14

如果我问错了你的问题,我深表歉意。

我可以提出一些建议来指导您走向正确的方向(或者我认为是正确的方向:)

1(相当小,但仍然很重要)我建议从该工作人员中获取 inets 启动代码并将其放入应用程序中状态代码(appname_app.erl)。据我所知,您正在使用钢筋模板,因此您应该拥有这些模板。

2 现在,进入重要部分。为了充分利用 OTP 的 supervisor 模型,假设您想要生成大量爬虫,那么使用 simple_one_for_one 会很有意义主管而不是one_for_one(请阅读http://www.erlang. org/doc/man/supervisor.html 了解更多详细信息,但重要部分是: simple_one_for_one - 简化的 one_for_one 管理程序,其中所有子进程都是动态添加相同进程类型的实例,即运行相同的代码。) 。因此,您不是只启动一个进程来进行监督,而是实际上指定一种“模板”——关于如何启动正在执行实际工作的工作进程。这种类型的每个工作人员都从使用 supervisor:start_child/2 开始 - http://erldocs.com/R14B01/stdlib/supervisor.html?i=1&search=start_chi#start_child/2。除非您明确启动这些工作人员,否则它们都不会启动。

2.1 根据爬虫的性质,您可能需要评估您的工作线程需要什么样的重启策略。现在在您的模板中,您已将其设置为永久(但是您有不同类型的受监督孩子)。以下是您的选择:

 Restart defines when a terminated child process should be restarted. A permanent child process should always be restarted, 
 a temporary child process should never be restarted and a transient child process should be restarted only if it terminates 
 abnormally, i.e. with another exit reason than normal.

因此,您可能想要类似的内容:

 -behaviour(supervisor).
 -define(CHILD(I, Type, Restart), {I, {I, start_link, []}, Restart, 5000, Type, [I]}).

 init(_Args) ->          
     Children = [
          ?CHILD(crawler, worker, transient)
     ],  
     RestartStrategy = {simple_one_for_one, 0, 1},
    {ok, {RestartStrategy, Children}}.

我冒昧地建议这些孩子暂时重新启动,因为这对于此类工作人员来说是有意义的(如果他们未能执行以下操作,则重新启动)工作,如果正常完成则不需要)

2.2 一旦你处理了上述项目,你的主管将处理任意数量的动态添加的工作进程;它将监视并重新启动(如果需要)它们中的每一个,这大大提高了系统的稳定性和可管理性。

3 现在,一个工作进程。我假设每个爬虫都有一些特定的状态,它在任何给定时刻都可能处于这种状态。因此,我建议使用 gen_fsm (有限状态机,有关它们的更多信息,请访问 http://learnyousomeerlang。 com/有限状态机)。这样,您动态添加到主管的每个 gen_fsm 实例都应该在 init/1 中向自身发送一个事件(使用 http://erldocs.com/R14B01/stdlib/gen_fsm.html?i=0&search=send_even#send_event/2)。

   init([Arg1]) ->
       gen_fsm:send_event(self(), start),
       {ok, initialized, #state{ arg1 = Arg }}.

   initialized(start, State) ->
       %% do your work
       %% and then either switch to next state {next_state, ...
       %% or stop the thing: {stop, ...

请注意,您的工作可以包含在这个 gen_fsm 进程中,或者您可以考虑为其生成一个单独的进程,具体取决于您的特定需求

如果认为有必要,您可能希望为爬网的不同阶段使用多个状态名称。

无论哪种方式,希望这将有助于以某种 OTP 风格的方式设计您的应用程序。如果您有任何疑问,请告诉我,如有必要,我很乐意添加一些内容。

I apologize if I got your question wrong.

A couple of suggestions that I can make to guide you in a right direction (or what I consider being a right direction :)

1 (Rather minor, but still important) I suggest getting inets startup code out of that worker and putting it in application statup code (appname_app.erl). As far as I can tell you're using rebar templates, so you should have those.

2 Now, onto essential parts. In order to make a full use of OTP's supervisor model, assuming that you want to spawn a large a large number of crawlers, it would make a lot of sense to use simple_one_for_one supervisors instead of one_for_one (read http://www.erlang.org/doc/man/supervisor.html for more details, but essential part is: simple_one_for_one - a simplified one_for_one supervisor, where all child processes are dynamically added instances of the same process type, i.e. running the same code.). So instead of launching just one process to supervise, you will actually specify a "template" of a sort — on how to start worker processes that are doing real job. Every worker of that kind is started using supervisor:start_child/2http://erldocs.com/R14B01/stdlib/supervisor.html?i=1&search=start_chi#start_child/2. None of those workers will start until you explicitly start them.

2.1 Depending on a nature of your crawlers, you might need to assess what kind of restart strategy you need for your workers. Right now in your template you have it set as permanent (however you have a different kind of supervised child). Here are your options:

 Restart defines when a terminated child process should be restarted. A permanent child process should always be restarted, 
 a temporary child process should never be restarted and a transient child process should be restarted only if it terminates 
 abnormally, i.e. with another exit reason than normal.

So, you might want to have something like:

 -behaviour(supervisor).
 -define(CHILD(I, Type, Restart), {I, {I, start_link, []}, Restart, 5000, Type, [I]}).

 init(_Args) ->          
     Children = [
          ?CHILD(crawler, worker, transient)
     ],  
     RestartStrategy = {simple_one_for_one, 0, 1},
    {ok, {RestartStrategy, Children}}.

I took a liberty of suggesting transient restarts for these children as it makes sense for this kind of workers (restart if they failed to do the job and don't if they completed normally)

2.2 Once you take care of the above items, your supervisor will be handling any number of dynamically added worker processes; and it will be monitoring and restarting (if necessary) each of them, which adds a great deal to your system stability and manageability.

3 Now, a worker process. I would assume that each crawler has some particular states which it might be in at any given moment. For that reason, I would suggest using gen_fsm (finite state machine, more about them available at http://learnyousomeerlang.com/finite-state-machines). This way, each gen_fsm instance you dynamically add to your supervisor, should send an event to itself in init/1 (using http://erldocs.com/R14B01/stdlib/gen_fsm.html?i=0&search=send_even#send_event/2).

Something alone the lines of:

   init([Arg1]) ->
       gen_fsm:send_event(self(), start),
       {ok, initialized, #state{ arg1 = Arg }}.

   initialized(start, State) ->
       %% do your work
       %% and then either switch to next state {next_state, ...
       %% or stop the thing: {stop, ...

Note that doing your work could be either contained within this gen_fsm process or you might consider spawning a separate process for it, depending on your particular needs.

You might want to have multiple state names for different phases of your crawling if it deems to be necessary.

Either way, hope this will help designing your application in a somewhat OTP-ish way. Please let me know if you have any questions, I'll be happy to add something if necessary.

这样的小城市 2024-10-28 16:52:14

你真的在跟踪你的 gen_server 中的任何状态吗?

如果答案是肯定的,那么看起来您正在以正确的方式做事。请注意,由于消息是序列化的,因此使用上述实现不能同时进行两个爬网。如果您需要并发抓取,请在此处查看我的问题的答案。

如果答案是否定的,那么您可以放弃服务器和管理程序,只使用应用程序模块进行任何初始化代码,如此处

最后, lhttpcibrowse 被认为是 inets 的更好替代品。我在广告服务器的生产中使用 lhttpc,效果非常好。

Are you actually keeping track of any state in your gen_server?

If the answer is yes, it looks like you are doing things the right way. Note that since messages are serialized, with the above implementation you could not have two crawls going at the same time. If you need concurrent crawls, see the answer to my question here.

If the answer is no then you can possibly ditch the server and the supervisor and just use the application module for any initialization code as seen here.

Finally, lhttpc and ibrowse are considered better alternatives to inets. I use lhttpc in production on my ad servers and it works great.

千笙结 2024-10-28 16:52:14

我对此问题的解决方案是研究 Erlang Solutions“作业”应用程序,该应用程序可用于调度作业(即请求页面)并让单独的系统处理每个作业,限制并发性等。

然后,您可以将新 URL 提供给进程 crawl_sched_mgr,该进程会过滤 URL,然后生成新作业。您也可以让请求者自己完成此操作。

如果你不想使用工作,尤里的建议是不错的选择。

My solution to this problem would be to look into the Erlang Solutions "jobs" application, which can be used to schedule jobs (i.e., requesting pages) and let a separate system handle each job, bound the concurrency and so on.

You can then feed new urls into a process crawl_sched_mgr which filters the urls and then spawns new jobs. You could also let the requestors do this themselves.

If you don't want to use jobs, Yurii's suggestion is the way to go.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文