R脚本的分布式调度系统
我想在多台机器(Windows 或 Ubuntu)上安排和分发 R 脚本(例如使用 RServe)的执行(一项任务仅在一台机器上)。
我不想重新发明轮子,而是想使用现有的系统以最佳方式分配这些任务,并且最好有一个 GUI 来控制脚本的正确执行。
1/ 是否有 R 包或库可以用于此目的?
2/ 一个似乎使用相当广泛的库是带有 Apache Hadoop 的 mapReduce。 我对这个框架没有经验。为了我的目的,您会建议什么安装/插件/设置?
编辑:以下是有关我的设置的更多详细信息:
我的办公室确实充满了机器(小型服务器或工作站),有时也用于其他目的。我想利用所有这些机器的计算能力并在它们上分发我的 R 脚本。
我还需要一个调度程序,例如。用于在固定时间或定期安排脚本的工具。 我同时使用 Windows 和 Ubuntu,但目前在其中一个系统上找到一个好的解决方案就足够了。 最后,我不需要服务器来获取脚本的结果。脚本执行诸如访问数据库、保存文件等操作,但不返回任何内容。我只是想找回错误/警告(如果有的话)。
I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).
I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.
1/ Is there a R package or a library that can be used for that?
2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?
Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想要做的是在您可以物理访问的机器上分发并行执行的作业,我强烈推荐 foreach 的 doRedis 后端。您可以阅读vignette PDF以获取更多详细信息。要点如下:
如果运行 Hadoop 的机器专用于集群而不是借用,那么 Hadoop 的工作效果最佳。设置 Hadoop 也需要相当大的开销,如果您需要 Hadoop 提供的映射/归约算法和分布式存储,那么这是值得的。
那么,你的配置到底是什么?您的办公室里是否堆满了您想要分配 R 作业的机器?您有专用集群吗?这是基于 EC2 或其他“云”的吗?
细节决定成败,因此如果细节明确,您可以获得更好的答案。
如果您希望工作人员执行作业并将作业结果重新配置回一个主节点,那么最好使用专用的 R 解决方案,而不是像 TakTuk 或 dsh 这样更通用的并行化工具的系统。
If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:
Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.
So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?
The devil is in the details, so you can get better answers if the details are explicit.
If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.
看看 TakTuk 和 dsh 作为起点。您也许可以使用 pssh 或 clusterssh,尽管这些可能需要更多努力。
Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.