是否可以使用EMR使用自定义Hadoop版本?
截至今天(2022-06-28),AWS EMR最新版本为6.6.0,使用Hadoop 3.2.1。
我需要使用不同的Hadoop版本(3.2.2)。我尝试了以下方法,但它行不通。您可以设置发行标签
或Hadoop版本,但不能同时进行。
client = boto3.client("emr", region_name="us-west-1")
response = client.run_job_flow(
ReleaseLabel="emr-6.6.0",
Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)
似乎不是一种选择的另一种方法是,将特定的hadoop jar加载带有sparksession.builder.getorcreate()
,就像这样:
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.getOrCreate()
是否可以运行具有不同hadoop版本的EMR群集?如果是这样,一个人如何做到这一点?
As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.
I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel
or Hadoop version, but not both.
client = boto3.client("emr", region_name="us-west-1")
response = client.run_job_flow(
ReleaseLabel="emr-6.6.0",
Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)
Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate()
, like so:
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.getOrCreate()
Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
恐怕不是。 AWS不希望允许不支持的Hadoop版本的支持头痛,因此它们总是落后的,因为它们大概需要时间来测试每个新版本及其与其他Hadoop工具的兼容性。 https://docs.aws.aws.amazon.com/ EMR/最新/ReleaseGuide/emr-660-release.html 。
您必须在EC2中从头开始构建自己的群集。
I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.
You'd have to build your own cluster from scratch in EC2.
您只需要在旋转群集时将脚本添加到Bootstrap部分,这是一个 - >
spark-patch-s3a-fix_emr-6.6.0.sh
=)Amazon仅为EMR 6.6.0提供了此修复程序。You just need to add the script to the Bootstrap section when you spin up your cluster, this one ->
spark-patch-s3a-fix_emr-6.6.0.sh
=) Amazon provided this fix only for EMR 6.6.0.