MapReduce 还是批处理作业?

发布于 2024-11-19 14:10:54 字数 198 浏览 4 评论 0原文

我有一个需要在很多文件(1000 个)上调用的函数。每个都是独立的,并且可以并行运行。每个文件的函数输出不需要(当前)与其他文件组合。我有很多服务器可以扩展,但我不知道该怎么做:

1)在其上运行 MapReduce

2)创建 1000 个作业(每个作业都有一个不同的文件)。

一种解决方案会比另一种更好吗?

谢谢!

I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:

1) Run a MapReduce on it

2) Create 1000's of jobs (each has a different file it works on).

Would one solution be preferable to another?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

攒眉千度 2024-11-26 14:10:54

MapReduce 将为分配大型数据集工作负载提供重要价值。就您而言,在小型独立数据文件上进行较小的独立作业,在我看来这可能有点矫枉过正。

因此,我更喜欢运行一堆动态创建的批处理文件。

或者,也可以使用集群管理器和作业调度程序,例如 SLURM https:// /computing.llnl.gov/linux/slurm/

SLURM:高度可扩展的资源管理器

SLURM 是专为 Linux 集群设计的开源资源管理器
各种尺寸。它提供三个关键功能。首先它分配
对资源(计算机节点)的独占和/或非独占访问
用户在一段时间内可以执行工作。其次,它
提供启动、执行和监控工作的框架
(通常是并行作业)在一组分配的节点上。最后,它
通过管理待处理队列来仲裁资源争用
工作。

MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.

So, I would prefer run a bunch of dynamically created batch files.

Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/

SLURM: A Highly Scalable Resource Manager

SLURM is an open-source resource manager designed for Linux clusters
of all sizes. It provides three key functions. First it allocates
exclusive and/or non-exclusive access to resources (computer nodes) to
users for some duration of time so they can perform work. Second, it
provides a framework for starting, executing, and monitoring work
(typically a parallel job) on a set of allocated nodes. Finally, it
arbitrates contention for resources by managing a queue of pending
work.

坚持沉默 2024-11-26 14:10:54

由于它只有 1000 个文件(而不是 1000000000 个文件),因此完整的 HADOOP 设置可能有点过头了。 GNU Parallel 试图填补顺序脚本和 HADOOP 之间的空白:

ls files | parallel -S server1,server2 your_processing {} '>' out{}

您可能想了解 --sshloginfile。根据文件存储的位置,您可能还想了解 --trc

观看介绍视频以了解详情:http://www.youtube.com/watch?v=OpaiGYxkSuQ

Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:

ls files | parallel -S server1,server2 your_processing {} '>' out{}

You will probably want to learn about --sshloginfile. Depending on where the files are stored you may want to learn --trc, too.

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

轮廓§ 2024-11-26 14:10:54

在 slurm 中使用作业数组。无需提交 1000 个作业...只需 1 个 - 数组作业。

这将在您指定的资源可用的尽可能多的节点/核心上启动相同的程序。最终它会搅动所有这些。您唯一的问题是如何将数组索引映射到要处理的文件。最简单的方法是准备一个文本文件,其中包含所有路径的列表,每行一个。作业数组的每个元素将获取该文件的第 i 行,并将其用作要处理的文件的路径。

Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.

This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文