评估伪分布式节点上的 Hadoop 可扩展性性能?
是否有任何工具、包或方法可用于仅使用使用伪分布式架构的单台机器来估计/模拟 Hadoop 的可扩展性性能?这样的系统需要根据模拟中互不干扰的作业(例如,阻塞的 I/O)进行准确的估计。
在我看来,这是如何工作的,我将按顺序运行所有映射/减少作业,并使用一些指标来估计系统的扩展程度(例如,采用运行时间最长的映射作业并估计运行时间将从而成为瓶颈)。
此外,我有多个映射/归约作业,它们被链接在一起以形成输出。
Are there any tools, packages, or methodologies available to estimate / simulate the scalability performance of Hadoop using only a single machine using a pseudo-distributed architecture? Such a system would need to make accurate estimations based on jobs that do not interfere with each other in the simulation (e.g., with blocked I/O).
In my mind, how this would work is that I'd run all my map / reduce jobs sequentially, and use some metric to estimate how well the system is scaling (e.g., take the longest running map job and estimate that the run time will be bottlenecked by it).
Additionally, I have multiple map/reduce jobs which are being chained together to form the output.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这很大程度上取决于你的工作性质。让我们尝试举几个例子:
1. 您的工作需要大量的输入格式化和映射器处理,而传递给减速器的数据最少。在这种情况下,我估计伪分布式集群将真实地反映真实的集群性能(每个插槽),并且您可以假设 5 节点集群将具有大约 x5 的性能。我建议输入足够的数据,使作业时间至少需要作业启动时间的 5-10 倍。如果您有足够的拆分来确保处理期间的数据局部性,则此估计会更好。
如果您计划拥有许多相对较小的文件 - 在测试中放入足够的文件,以模拟每个任务的开销。
2. 严重依赖 Hadoop 分布式排序功能(洗牌)。它在一个节点和真实集群中的性能可能有很大差异,而且这个因素很难估计。
我可以总结映射器的吞吐量,在某种程度上,reducer 的吞吐量可以从上面估计的每个插槽的 MB/秒来计算。真正的集群每个插槽可能不会有更好的性能。
I think it is largely depends on the nature of your job. Let us try to take a few examples:
1. Your job has heavy input formatting and mapper processing, with minimal data passed to reducer. In this case I would estimate that pseudo distributed cluster will realistically reflect real cluster performance (per slot) and you can assume that 5 nodes cluster will have about x5 performance. I would suggest to put enough data that job time will take at least 5-10 times of the job start-up time. This estimation will be better if you have enough splits to ensure data locality during processing.
If you plan to have a lot of relatively small files - put enough in your test, to simulate per task overhead.
2. Your heavily relaying on Hadoop distributed sort capability (shuffling). Its performance in one node and real cluster can be quite different and the factor is hard to estimate.
I can summarize that throughput of mapper and, in some extent, reducer in terms of MB/sec per slot you can estimated from above. Real cluster probably will have not better performance per slot.