在 Hadoop MapReduce 脚本中导入外部库
我正在 Amazon EMR Hadoop 实现之上运行 python MapReduce 脚本。从主脚本中,我得到了项目的相似性。在善后步骤中,我想将此输出拆分到每个项目的单独 S3 存储桶中,因此每个项目存储桶都包含与其类似的项目列表。为了实现这一点,我想在后续步骤的reduce函数中使用亚马逊的boto python库。
- 如何将外部(python)库导入hadoop,以便它们可以在用python编写的reduce步骤中使用?
- 是否可以在 Hadoop 环境中以这种方式访问 S3?
提前致谢, 托马斯
I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.
- How do I import external (python) libraries into hadoop, so that they can be used in a reduce step written in python?
- Is it possible to access S3 in that way inside the Hadoop environment?
Thanks in advance,
Thomas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
启动 hadoop 进程时,您可以指定应可用的外部文件。这是通过使用
-files
参数来完成的。$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat
我不知道这些文件是否必须位于 HDFS 上,但如果这是一项经常运行的作业,那么将它们放在那里并不是一个坏主意。
从代码中,您可以执行类似于
“这几乎是直接从多个映射器内的工作代码复制并粘贴”的操作。
我不知道你问题的第二部分。希望第一部分的答案能让您开始。 :)
除了
-files
之外,还有-libjars
用于包含其他 jar;我有一些关于这里的信息 - 如果我有一个需要文件路径的构造函数,如果将其打包到 jar 中,我该如何“伪造”它?When launching a hadoop process you can specify external files that should be made available. This is done by using the
-files
argument.$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat
I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to
This is all but copy and pasted directly from working code inside multiple of our Mappers.
I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)
In addition to
-files
there is-libjars
for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?