用于优化 Hadoop 应用程序可扩展性的工具?
我正在与我的团队合作开发一个小型应用程序,该应用程序需要大量输入(一天的日志文件)并在几个(现在是 4 个,将来可能是 10 个)映射减少步骤(Hadoop 和 Java)之后产生有用的输出)。
现在我已经完成了这个应用程序的部分 POC 并在 4 个旧桌面(我的 Hadoop 测试集群)上运行它。我注意到,如果你的分区“错误”,水平缩放特性就会被破坏得面目全非。我发现比较单个节点上的测试运行(比如 20 分钟)和所有 4 个节点上的测试运行仅导致 50% 的加速(大约 10 分钟),而我预计 75%(或至少 >70%)的加速(大约5或6分钟)。
水平进行map-reduce缩放的总体原则是确保分区尽可能独立。我发现在我的例子中,我对每个步骤的分区都“错误”,因为我只是使用了默认的哈希分区器;这使得记录在下一个映射缩减步骤中跳转到不同的分区。
我希望(尚未尝试过)如果我能说服尽可能多的记录保留在同一个分区中(即构建自定义分区程序),我可以加快速度并使其扩展得更好。
在上述情况下,我手动找到了这个解决方案。我在上班的车里认真思考这件事,推断出了问题所在。
现在我向大家提出问题: - 有哪些工具可以检测此类问题? - 是否有任何需要遵循的指南/清单? - 我如何测量“跳过分区的记录数”之类的东西?
任何建议(工具、教程、书籍……)都将不胜感激。
I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java).
Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you do the partitioning "wrong" the horizontal-scaling characteristics are wrecked beyond recognition. I found that comparing a test run on a single node (say 20 minutes) and on all 4 nodes only resulted in 50% speedup (about 10 minutes) where I expected a 75% (or at least >70%) speedup (about 5 or 6 minutes).
The general principle of making map-reduce scale horizontally is to ensure that the partitions are as independent as possible. I found that in my case I did the partitioning of each step "wrong" because I simply used the default Hash partitioner; this make the records jump around to a different partition in the next map-reduce step.
I expect (haven't tried it yet) that I can speed the thing up and make is scale much better if I can convince as many records as possible to stay in the same partition (i.e. build a custom partitioner).
In the above described case I found this solution by hand. I deduced what went wrong by thinking hard about this while in my car to work.
Now my question to you all:
- What tools are available to detect issues like this?
- Are there any guidelines/checklists to follow?
- How do I go about measuring things like "the number of records that jumped partition"?
Any suggestions (tools, tutorials, book, ...) are greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
确保您没有遇到小文件问题。 Hadoop 针对吞吐量而不是延迟进行了优化,因此与存储在 hdfs 上的许多单独文件相比,它处理加入到一个大型序列文件中的许多日志文件的速度要快得多。以这种方式使用序列文件可以消除处理单个地图所需的额外时间,并减少任务并提高数据局部性。但是,是的,重要的是您的映射输出合理地分配给减速器,以确保一些减速器不会因不成比例的工作量而过载。
Make sure that you're not running into the small files problem. Hadoop is optimized for through-put rather than latency, so it will process many log-files joined into one large sequence file much more quickly than it will many individual files stored on the hdfs. Using sequences files in this way eliminates extra time needed do house-keeping for individual map and reduce tasks and improves data locality. But yes, it's important that your map outputs are reasonably well distributed to the reducers, to ensure that a few reducers are not overloaded with a disproportionate amount of work.
查看 Netbeans/Eclipse 的 Karmashpere(以前称为 hadoop studio)插件:http://karmasphere。 com/Download/download.html。有免费版本可以帮助检测和测试运行 hadoop 作业。
我已经对其进行了一些测试,看起来很有希望。
Take a look at Karmashpere (formerly known as hadoop studio) plugin for Netbeans/Eclipse : http://karmasphere.com/Download/download.html. There's free version that can help with detecting and test-running hadoop jobs.
I have tested it a little and it looks promising.