R脚本的分布式调度系统

发布于 2024-12-23 12:38:32 字数 527 浏览 0 评论 0原文

我想在多台机器(Windows 或 Ubuntu)上安排和分发 R 脚本(例如使用 RServe)的执行(一项任务仅在一台机器上)。

我不想重新发明轮子,而是想使用现有的系统以最佳方式分配这些任务,并且最好有一个 GUI 来控制脚本的正确执行。

1/ 是否有 R 包或库可以用于此目的?

2/ 一个似乎使用相当广泛的库是带有 Apache Hadoop 的 mapReduce。 我对这个框架没有经验。为了我的目的,您会建议什么安装/插件/设置?

编辑:以下是有关我的设置的更多详细信息:
我的办公室确实充满了机器(小型服务器或工作站),有时也用于其他目的。我想利用所有这些机器的计算能力并在它们上分发我的 R 脚本。
我还需要一个调度程序,例如。用于在固定时间或定期安排脚本的工具。 我同时使用 Windows 和 Ubuntu,但目前在其中一个系统上找到一个好的解决方案就足够了。 最后,我不需要服务器来获取脚本的结果。脚本执行诸如访问数据库、保存文件等操作,但不返回任何内容。我只是想找回错误/警告(如果有的话)。

I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).

I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.

1/ Is there a R package or a library that can be used for that?

2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?

Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吹梦到西洲 2024-12-30 12:38:32

如果您想要做的是在您可以物理访问的机器上分发并行执行的作业,我强烈推荐 foreach 的 doRedis 后端。您可以阅读vignette PDF以获取更多详细信息。要点如下:

为什么要编写 doRedis 包?毕竟foreach包已经
有许多可用的并行后端包,包括 doMC、doSNOW
和 doMPI。 doRedis 包允许动态工作池。
即使在运行过程中,也可以随时添加新工人
计算。例如,此功能与现代云相关
计算环境。用户可以做出经济决定\
随时获取更多计算资源以加速
运行计算。同样,现代的doRedis封装集群
资源分配系统可以动态调度 R 工作线程
集群资源可用

如果运行 Hadoop 的机器专用于集群而不是借用,那么 Hadoop 的工作效果最佳。设置 Hadoop 也需要相当大的开销,如果您需要 Hadoop 提供的映射/归约算法和分布式存储,那么这是值得的。

那么,你的配置到底是什么?您的办公室里是否堆满了您想要分配 R 作业的机器?您有专用集群吗?这是基于 EC2 或其他“云”的吗?

细节决定成败,因此如果细节明确,您可以获得更好的答案。

如果您希望工作人员执行作业并将作业结果重新配置回一个主节点,那么最好使用专用的 R 解决方案,而不是像 TakTuk 或 dsh 这样更通用的并行化工具的系统。

If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:

Why write a doRedis package? After all, the foreach package already
has available many parallel back end packages, including doMC, doSNOW
and doMPI. The doRedis package allows for dynamic pools of workers.
New workers may be added at any time, even in the middle of running
computations. This feature is relevant, for example, to modern cloud
computing environments. Users can make an economic decision to \turn
on" more computing resources at any time in order to accelerate
running computations. Similarly, modernThe doRedis Package cluster
resource allocation systems can dynamically schedule R workers as
cluster resources become available

Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.

So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?

The devil is in the details, so you can get better answers if the details are explicit.

If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.

和我恋爱吧 2024-12-30 12:38:32

看看 TakTukdsh 作为起点。您也许可以使用 psshclusterssh,尽管这些可能需要更多努力。

Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文