MapReduce 还是批处理作业?
我有一个需要在很多文件(1000 个)上调用的函数。每个都是独立的,并且可以并行运行。每个文件的函数输出不需要(当前)与其他文件组合。我有很多服务器可以扩展,但我不知道该怎么做:
1)在其上运行 MapReduce
2)创建 1000 个作业(每个作业都有一个不同的文件)。
一种解决方案会比另一种更好吗?
谢谢!
I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:
1) Run a MapReduce on it
2) Create 1000's of jobs (each has a different file it works on).
Would one solution be preferable to another?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
MapReduce 将为分配大型数据集工作负载提供重要价值。就您而言,在小型独立数据文件上进行较小的独立作业,在我看来这可能有点矫枉过正。
因此,我更喜欢运行一堆动态创建的批处理文件。
或者,也可以使用集群管理器和作业调度程序,例如 SLURM https:// /computing.llnl.gov/linux/slurm/
MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.
So, I would prefer run a bunch of dynamically created batch files.
Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/
由于它只有 1000 个文件(而不是 1000000000 个文件),因此完整的 HADOOP 设置可能有点过头了。 GNU Parallel 试图填补顺序脚本和 HADOOP 之间的空白:
您可能想了解
--sshloginfile
。根据文件存储的位置,您可能还想了解--trc
。观看介绍视频以了解详情:http://www.youtube.com/watch?v=OpaiGYxkSuQ
Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:
You will probably want to learn about
--sshloginfile
. Depending on where the files are stored you may want to learn--trc
, too.Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
在 slurm 中使用作业数组。无需提交 1000 个作业...只需 1 个 - 数组作业。
这将在您指定的资源可用的尽可能多的节点/核心上启动相同的程序。最终它会搅动所有这些。您唯一的问题是如何将数组索引映射到要处理的文件。最简单的方法是准备一个文本文件,其中包含所有路径的列表,每行一个。作业数组的每个元素将获取该文件的第 i 行,并将其用作要处理的文件的路径。
Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.
This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.