是否可以使用EMR使用自定义Hadoop版本?

发布于 2025-02-11 07:46:55 字数 823 浏览 2 评论 0原文

截至今天(2022-06-28),AWS EMR最新版本为6.6.0,使用Hadoop 3.2.1。

我需要使用不同的Hadoop版本(3.2.2)。我尝试了以下方法,但它行不通。您可以设置发行标签或Hadoop版本,但不能同时进行。

client = boto3.client("emr", region_name="us-west-1")

response = client.run_job_flow(
    ReleaseLabel="emr-6.6.0",
    Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)

似乎不是一种选择的另一种方法是,将特定的hadoop jar加载带有sparksession.builder.getorcreate(),就像这样:

spark = SparkSession \
        .builder \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
        .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
        .getOrCreate()

是否可以运行具有不同hadoop版本的EMR群集?如果是这样,一个人如何做到这一点?

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.

I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.

client = boto3.client("emr", region_name="us-west-1")

response = client.run_job_flow(
    ReleaseLabel="emr-6.6.0",
    Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)

Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:

spark = SparkSession \
        .builder \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
        .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
        .getOrCreate()

Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡水深流 2025-02-18 07:46:55

恐怕不是。 AWS不希望允许不支持的Hadoop版本的支持头痛,因此它们总是落后的,因为它们大概需要时间来测试每个新版本及其与其他Hadoop工具的兼容性。 https://docs.aws.aws.amazon.com/ EMR/最新/ReleaseGuide/emr-660-release.html

您必须在EC2中从头开始构建自己的群集。

I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.

You'd have to build your own cluster from scratch in EC2.

放血 2025-02-18 07:46:55

您只需要在旋转群集时将脚本添加到Bootstrap部分,这是一个 - > spark-patch-s3a-fix_emr-6.6.0.sh =)Amazon仅为EMR 6.6.0提供了此修复程序。

You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文