在 Hadoop MapReduce 脚本中导入外部库

发布于 2024-10-17 03:01:47 字数 303 浏览 2 评论 0原文

我正在 Amazon EMR Hadoop 实现之上运行 python MapReduce 脚本。从主脚本中，我得到了项目的相似性。在善后步骤中，我想将此输出拆分到每个项目的单独 S3 存储桶中，因此每个项目存储桶都包含与其类似的项目列表。为了实现这一点，我想在后续步骤的reduce函数中使用亚马逊的boto python库。

如何将外部（python）库导入hadoop，以便它们可以在用python编写的reduce步骤中使用？
是否可以在 Hadoop 环境中以这种方式访问 S3？

提前致谢，托马斯

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鲜肉鲜肉永远不皱 2024-10-24 03:01:47

启动 hadoop 进程时，您可以指定应可用的外部文件。这是通过使用 -files 参数来完成的。

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

我不知道这些文件是否必须位于 HDFS 上，但如果这是一项经常运行的作业，那么将它们放在那里并不是一个坏主意。
从代码中，您可以执行类似于

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

“这几乎是直接从多个映射器内的工作代码复制并粘贴”的操作。

我不知道你问题的第二部分。希望第一部分的答案能让您开始。 :)

除了 -files 之外，还有 -libjars 用于包含其他 jar；我有一些关于这里的信息 - 如果我有一个需要文件路径的构造函数，如果将其打包到 jar 中，我该如何“伪造”它？

When launching a hadoop process you can specify external files that should be made available. This is done by using the -files argument.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

This is all but copy and pasted directly from working code inside multiple of our Mappers.

I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)

In addition to -files there is -libjars for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

回复收藏 0 原文

~没有更多了~