Erlang:分布式应用程序奇怪的行为
我使用分布式 erlang 应用程序进行支付。
配置和思路取自:
http://www.erlang.org/doc/pdf/otp-system-documentation.pdf 9.9。分布式应用程序
- 我们有 3 个节点:n1@a2-X201、n2@a2-X201、n3@a2-X201
- 我们有应用程序 wd 执行一些有用的工作:)
配置文件:
- wd1 .config - 对于第一个节点:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n2@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ]} ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n1.log"}} ] }].
- wd2.config 对于第二个节点:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n1@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ] } ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n2.log"}} ] }].
- 对于节点 n3 看起来类似。
现在在3个单独的终端中启动erlang:
- erl -sname n1@a2-X201 -config wd1 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n2@a2-X201 -config wd2 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n3@a2 -X201 -config wd3 -pa $WD_EBIN_PATH -boot start_sasl
在每个 erlang 节点上启动应用程序: * 应用程序:启动(wd)。
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:42:51 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok
(n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
(n3@a2-X201)1> application:start(wd). ok (n3@a2-X201)2>
目前一切正常。正如 Erlang 文档中所写:应用程序正在节点 n1@a2-X201 上运行
,现在杀死节点 n1: 应用程序已迁移到n2
(n2@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:46:28 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
继续我们的游戏:杀死节点n2 多一个时间系统工作正常。我们的应用程序位于节点n3,
(n3@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:48:18 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
现在恢复节点n1和n2。 所以:
Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.8.1 (abort with ^G) (n1@a2-X201)1> Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1>
节点 n1 和 n2 又回来了。
看起来现在我必须手动重新启动应用程序: * 让我们首先在节点n2 上执行此操作:
(n2@a2-X201)1> application:start(wd).
- 看起来它已挂起...
- 现在在n1 上重新启动它,
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:55:43 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok (n1@a2-X201)2>
它可以工作。并且节点 n2 也返回了 OK:
Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
在节点 n3 处,我们看到:
=INFO REPORT==== 19-Jun-2011::15:55:43 === application: wd exited: stopped type: temporary
一般来说,一切看起来都正常,如文档中所写,除了在节点 n3 处启动应用程序时出现延迟>n2。
现在再次杀死节点n1:
(n1@a2-X201)2> User switch command --> q [a2@a2-X201 releases]$
Ops ...一切都挂起。应用程序未在另一个节点重新启动。
事实上,当我写这篇文章时,我意识到有时一切都很好,有时我遇到了问题。
有什么想法吗?虽然恢复“主”节点并再次杀死它时可能会出现问题?
I'm paying with distributed erlang applications.
Configuration and ideas are taken from:
http:/www.erlang.org/doc/pdf/otp-system-documentation.pdf 9.9. Distributed Applications
- We have 3 nodes: n1@a2-X201, n2@a2-X201, n3@a2-X201
- We have application wd that do some useful job :)
Configuration files:
- wd1.config - for the first node:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n2@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ]} ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n1.log"}} ] }].
- wd2.config for the second:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n1@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ] } ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n2.log"}} ] }].
- For the node n3 looks similar.
Now start erlang in 3 separate terminals:
- erl -sname n1@a2-X201 -config wd1 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n2@a2-X201 -config wd2 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n3@a2-X201 -config wd3 -pa $WD_EBIN_PATH -boot start_sasl
Start application on each of erlang nodes:
* application:start(wd).
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:42:51 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok
(n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
(n3@a2-X201)1> application:start(wd). ok (n3@a2-X201)2>
At the moment everything is Ok. As written in Erlang documentation: Application is running at node n1@a2-X201
Now kill node n1:
Application was migrated to n2
(n2@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:46:28 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
Continue our game: kill node n2
One more time system works fine. We have our application at node n3
(n3@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:48:18 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
Now restore nodes n1 and n2.
So:
Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.8.1 (abort with ^G) (n1@a2-X201)1> Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1>
Nodes n1 and n2 are back.
Looks like now I have to restart application manually:
* Let's do it at node n2 first:
(n2@a2-X201)1> application:start(wd).
- Looks like it hanged ...
- Now restart it at n1
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:55:43 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok (n1@a2-X201)2>
It works. And node n2 also has returned OK:
Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
At node n3 we see:
=INFO REPORT==== 19-Jun-2011::15:55:43 === application: wd exited: stopped type: temporary
In general, everything looks ok, as written in documentation, except for delay with starting application at node n2.
Now kill node n1 once more:
(n1@a2-X201)2> User switch command --> q [a2@a2-X201 releases]$
Ops ... everything hangs. Application was not restarted at another node.
Actually, while I was writing this post I've realized that sometime everything id Ok, sometime I have a problem.
Any ideas, While there could be problems when restoring "primary" node nd killing it one more time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如 在 Learn You Some Erlang 中所解释的(滚动到底部),分布式应用程序只有在以下情况下才能正常工作:作为版本的一部分启动,而不是在您使用
application:start
手动启动它们时启动。As explained over at Learn You Some Erlang (scroll to the bottom), distributed applications only work well when started as part of a release, not when you start them manually with
application:start
.您所看到的奇怪现象很可能与您在节点 n1/n2 上完全重新启动应用程序有关,而 n3 仍在初始应用程序初始化下运行。
如果您的应用程序启动任何系统范围的进程并使用它们的 pid,而不是使用全局设置的注册名称,例如 pg 或 pg2,那么您可能正在使用两组全局状态。
如果是这种情况,建议采取的方法是专注于从现有应用程序中添加/删除节点,而不是重新启动整个应用程序。通过这种方式,节点可以离开并加入到一组现有的初始化值中。
Chances are the oddity you're seeing is likely to do with you restarting your application entirely on nodes n1/n2 while n3 is still running under the initial application initialisation.
If your application starts any system-wide processes and uses their pids rather than using registered names set with global, pg or pg2 for example, then you may be working with two sets of global state.
If this is the case, the recommended approach to take is to focus on adding/removing nodes from an existing application rather than restarting an application in it's entirety. This way nodes are leaving and joining into an existing set of initialised values.