弹性地图减少外部罐子
因此,直接使用 hadoop 时处理外部 jar 是很容易的。您有 -libjars 选项可以为您完成此操作。问题是如何使用 EMR 来做到这一点。一定有一种简单的方法可以做到这一点。我认为 CLI 的 -cachefile 选项可以做到这一点,但我无法让它以某种方式工作。有人有什么想法吗?
感谢您的帮助。
So, it is easy enough to handle external jars when using hadoop straight up. You have -libjars option that will do this for you. The question is how do you do this with EMR. There must be an easy way of doing it. I thought -cachefile option of the CLI would do it, but I couldn't get it working somehow. Any ideas anyone?
Thanks for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我在外部 jar 依赖项方面遇到的最好的运气是将它们(通过引导操作)复制到整个集群的
/home/hadoop/lib
。该路径位于每个主机的类路径上。这种技术似乎是唯一一种无论访问外部 jar(工具、作业或任务)的代码位于何处都有效的技术。The best luck I have had with external jar dependencies is to copy them (via bootstrap action) to
/home/hadoop/lib
throughout the cluster. That path is on the classpath of every host. This technique is the only one that seems to work regardless of where the code lives that accesses external jars (tool, job, or task).一个选择是在工作流程的第一步中将 JAR 设置在需要的位置。或者,如果它们是依赖项,您可以将它们打包到应用程序 JAR 中(可能位于 S3 中)。
One option is to have the first step in your jobflow set up the JARs wherever they need to be. Or, if they are dependencies, you can package them in with your application JAR (which is probably in S3).
仅供参考,新版本的 EMR /home/hadoop/lib 不再使用。应使用 /usr/lib/hadoop-mapreduce。
FYI for newer versions of EMR /home/hadoop/lib is not used anymore. /usr/lib/hadoop-mapreduce should be used.