在实践中,您需要多少台机器才能让 Hadoop / MapReduce / Mahout 加速可并行化的计算?
我需要进行一些繁重的机器学习计算。我在局域网上有少量空闲的机器。我需要多少台机器才能使用 hadoop / mapreduce / mahout 分布式计算,以便比在没有这些分布式框架的情况下在单台机器上运行要快得多?这是一个计算开销与收益的实际问题,因为我假设在两台机器之间分配总时间会比不分配并简单地在一台机器上运行更糟糕(只是因为分配计算涉及的所有开销)。
技术说明:一些繁重的计算是非常可并行的。所有这些只要每台机器都有自己的原始数据副本即可。
I need to do some heavy machine learning computations. I have a small number of machines idle on a LAN. How many machines would I need in order for distrubuting my computations using hadoop / mapreduce / mahout to to be significantly faster than running on a single machine without these distributed frameworks? This is a practical question of computational overhead versus gains as I assume distributing between just 2 machines the overall time would be worse than not distributing and simply running on a single machine (just because of all the overhead involved to distribute the computations).
Technical note: Some of the heavy computations are very parallelizable. All of them are as long as each machine has it's own copy of the raw data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
“普通”Java 程序和基于 Hadoop、MapReduce 的实现是非常不同的野兽,很难进行比较。这并不像 Hadoop 会并行化您的程序的一小部分;而是会并行化您的程序。它从上到下以完全不同的形式书写。
Hadoop 有开销:只是启动作业以及启动映射器和减速器等工作器的开销。它会花费更多的时间来序列化/反序列化数据、在本地写入数据并将其传输到 HDFS。
基于 Hadoop 的实施总是会消耗更多的资源。所以,除非你无法避免,否则这是应该避免的事情。如果您可以在一台机器上运行非分布式计算,那么最简单的实用建议就是不要分布式。省去自己的麻烦。
就 Mahout 推荐器而言,我可以非常粗略地告诉您,对于相同数据,Hadoop 作业的计算量比非分布式实现多 2-4 倍。显然,这在很大程度上取决于算法和算法调整选择。但给你一个数字:我不会为少于 4 台机器的 Hadoop 集群烦恼。
显然,如果你的计算无法适应你的一台机器,你别无选择,只能进行分发。然后,权衡是您可以允许什么样的挂钟时间与您可以投入多少计算能力。对阿姆达尔定律的引用是正确的,尽管它没有考虑 Hadoop 的巨大开销。例如,要并行化 N 种方式,您至少需要 N 个映射器/减速器,并且会产生 N 倍的每个映射器/减速器开销。还有一些固定的启动/关闭时间。
A "plain" Java program and a Hadoop-based, MapReduce-based implementation are very different beasts and are hard to compare. It's not like Hadoop parallelizes a little bit of your program; it is written in an entirely different form from top to bottom.
Hadoop has overheads: just the overhead of starting a job, and starting workers like mappers and reducers. It introduces a lot more time spent serializing/deserializing data, writing it locally, and transferring it to HDFS.
A Hadoop-based implementation will always consume more resources. So, it's something to avoid unless you can't avoid it. If you can run a non-distributed computation on one machine, the simplest practical advice is to not distribute. Save yourself the trouble.
In the case of Mahout recommenders, I can tell you that very crudely, a Hadoop job incurs 2-4x more computation than a non-distributed implementation on the same data. Obviously that depends immensely on the algorithm and algo tuning choices. But to give you a number: I wouldn't bother with a Hadoop cluster of less than 4 machines.
Obviously, if your computation can't fit on one of your machines, you have no choice but to distribute. Then the tradeoff is what kind of wall-clock time you can allow versus how much computing power you can devote. The reference to Amdahl's law is right, though it doesn't consider the significant overhead of Hadoop. For example, to parallelize N ways, you need at least N mappers/reducers, and incur N times the per-mapper/reducer overhead. There's some fixed startup/shutdown time too.
请参阅阿姆达尔定律
如果没有具体细节,很难给出更详细的答案。
See Amdahl's Law
Without specifics it's difficult to give a more detailed answer.
我知道这个问题已经得到了回答,但我还是要投身其中。我无法给你一个一般的经验法则。性能的提高实际上取决于许多因素:
如果您有高度连接的算法,例如贝叶斯网络、神经网络、马尔可夫、PCA 和 EM,那么hadoop 程序的很多时间都会处理、拆分和重新组合实例。 [假设每个实例有大量节点(超过 1 台机器可以处理)。如果遇到这样的情况,网络流量将成为一个更大的问题。
如果您有诸如路径查找或模拟退火之类的算法,那么很容易将实例分离到它们自己的 Map/Reduce 作业中。这些类型的算法可能非常快。
I know this has already been answered, but I'll throw my hat into the ring. I can't give you a general rule of thumb. The performance increase really depends on many factors:
If you have a highly connected algorithm like a Bayes net, neural nets, markov, PCA, and EM then a lot of the time of the hadoop program will be getting instances processed, split, and recombined. [Assuming you have a large number of nodes per instance (more than 1 machine can handle). If you have a situation like this the network traffic will become more of an issue.
If you have an agorithm such as path finding, or simulated annealing, that is easy to seperate instances into their own map/reduce job. These types of algorithms can be very quick.
另一个方面是迫使您使用 MapReduce 的瓶颈是什么。如果您的单机上有合理的数据大小,并且您只是探测速度提升,那么您可以更喜欢使用 GPU 实现。即使在一台机器上,它们也更容易设置和使用,并且效果良好。
Another aspect is what is you bottleneck that forces you to use mapreduce. If you have reasonable data size good in your single machine and you merely probe speed boost then you can prefer to use GPU implementations. They are easier to setup and use even in a single machine with promising results.