可以使用 MapReduce 和 Hadoop 并行处理批处理作业吗?
我们的组织有数百个夜间运行的批处理作业。其中许多工作需要 2、3、4 小时才能完成;有的甚至需要长达7个小时。目前,这些作业以单线程模式运行,因此我们提高性能的尝试受到使用额外 CPU 和内存的机器垂直扩展的限制。
我们正在探索利用并行处理技术(例如 MapReduce)来减少完成这些作业所需的时间。我们的大多数批处理通常从数据库中提取大型数据集,逐行处理数据,并将结果作为文件转储到另一个数据库中。在大多数情况下,各个行的处理独立于其他行。
现在我们正在研究 MapReduce 框架,将这些作业分解成更小的部分以进行并行处理。我们的组织拥有超过 400 台员工台式电脑,我们希望在非工作时间利用这些机器作为并行处理网格。
我们需要什么才能让它发挥作用? Hadoop 是唯一需要的组件吗?我们还需要HBase吗?我们对所有不同的产品感到有点困惑,需要一些帮助。
谢谢
Our organization has hundreds of batch jobs that run overnight. Many of these jobs require 2, 3, 4 hours to complete; some even require up to 7 hours. Currently, these jobs run in single-threaded mode, so our attempts to increase performance is limited by vertical scaling of the machine with additional CPU and memory.
We are exploring the idea of leveraging parallel processing techniques, such as Map Reduce, to cut down the time required for these jobs to complete. Most of our batch processes pull in large data sets, typically from a database, process the data row by row, and dump the result as a file into another database. In most cases, processing of individual rows is independent of other rows.
Now we are looking at Map Reduce frameworks to break up these jobs into smaller pieces for parallel processing. Our organization has over 400 employee desktop PC's, and we would like to utilize these machines off business hours as the parallel processing grid.
What do we need to get this working? Is Hadoop the only component required? Do we also need HBase? We are slightly confused by all the different offerings and needed some assistance.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里有几个问题——关于 MapReduce,以及关于使用 400 台 PC 来完成这项工作。
你所描述的绝对是可能的,但我认为现阶段选择像 MapReduce 这样的特定编程模型可能还为时过早。
让我们首先考虑使用 400 个桌面的想法。原则上这是完全可行的。它有其自身的挑战 - 请注意,例如,让一堆桌面级计算机过夜永远不会像专用集群节点那样节能。而且桌面节点不如集群节点可靠 - 有些可能会关闭,有些可能会出现网络问题,有些东西仍在运行,从而减慢了计算作业的速度。但有一些框架可以解决这个问题。我熟悉的一个是 Condor,它正是利用这种类型而得名的的情况。它可以在Windows和Linux上运行(并且在混合环境中表现良好),并且非常灵活;即使在白天,您也可以自动使用未使用的机器。
可能还有其他这样的“机会主义计算”系统,也许其他人可以建议它们。您也可以使用其他集群解决方案,并使用传统的排队系统来运行您的作业(sge、rocks 等),但大多数人都认为机器始终是他们的可供使用。
至于MapReduce,如果你的大部分计算确实是(数据库的独立访问)→(独立计算)→(将独立行放入第二个数据库)的形式,我认为MapReduce甚至可能对你想要的东西来说是矫枉过正。您可能可以编写一些脚本来将作业划分为单独的任务并单独运行它们,而无需整个 MapReduce 系统及其关联的非常特殊的文件系统的开销。但如果您愿意,您可以在某些调度/资源管理器类型系统(如Condor)之上运行mapreduce。秃鹰之上的 Hadoop 有 悠久的历史。
There's a couple questions here -- about MapReduce, and about making use of 400 PCs for the job.
What you're describing is definitely possible, but I think it might be too early to be choosing a particular programming model like Map Reduce at this stage.
Let's take the using 400 desktops idea first. This is, in principle, completely doable. It has its own challenges -- note that, for instance, leaving a bunch of desktop-class machines on overnight will never be as power-efficient as dedicated cluster nodes. And the desktop nodes are not as reliable as cluster nodes - some might be shut off, some might have network problems, something left running on them which slows a compute job. But there are frameworks that can deal with this. The one I'm familiar with is Condor, which made its name making use of exactly this sort of situation. It runs on windows and linux (and does fine in mixed environments), and is very flexible; you can automatcally have it make use of unused machines even during the day.
There are likely other such "opportunistic computing" systems out there, and maybe others can suggest them. You could other clustering solutions too and use a traditional queueing system to run your jobs (sge, rocks, etc), but most of them assume that the machines are always theirs to be used.
As to MapReduce, if most of your computing is really of the form of (independant accesses of a DB) → (independant computations) → (put independant row into a 2nd DB), I think MapReduce might even be overkill for what you want. You could probably script something to partition the job into individual tasks and run them individually without having the overhead of an entire MapReduce system and its associated very particular filesystem. But if you want to, you could run mapreduce on top of some scheduling / resource manager type system like condor. Hadoop on top of condor has a long history.