EMR,Spark:适当的本地共享缓存位置
在我们的Spark应用程序中,我们将本地应用程序缓存存储在/mnt/YARN/APP-CACHE/
目录中,该目录在同一EC2实例
/mnt/... 上共享应用程序容器之间共享。选择
是因为它是R5D实例上的快速NVME SSD,
此方法在EMR 5.x- /mnt/yarn 属于emr 5.x的运行良好。
YARN
用户,应用程序容器由YARN
运行,并且可以
在EMR 6.x中 创建目录。无法写入/mnt/YARN/
Hadoop
用户可以在/mnt/code>中创建目录,但是
YARN
不能,我想保持兼容性 - 该应用应该能够在EMR 5.x和6.x
Java.io.tmpdir
上成功运行。 't工作 - 每个容器的不同是
在NVME SSD上存储缓存的适当位置(/mnt
,/mnt1
),以便可以通过所有容器都可以在EMR 5.x和6.x上操作?
In our Spark application, we store the local application cache in /mnt/yarn/app-cache/
directory, which is shared between app containers on the same ec2 instance
/mnt/...
is chosen because it is a fast NVMe SSD on r5d instances
This approach worked well for several years on EMR 5.x - /mnt/yarn
belongs to the yarn
user, and apps containers run from yarn
, and it can create directories
In EMR 6.x things changed - containers now run from the hadoop
user which does not have write access to /mnt/yarn/
hadoop
user can create directories in /mnt/
, but yarn
can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x
java.io.tmpdir
also doesn't work - it is different for each container
What should be the proper place to store cache on NVMe SSD (/mnt
, /mnt1
) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在您的EMR群集上,您可以将
YARN
用户添加到超级用户组;默认情况下,此组称为superGroup
。您可以通过选中dfs.permissions.superusergroup
在hdfs-site.xml
文件中确认这是正确的组。您也可以尝试修改以下HDFS属性(在上述命名的文件中):
dfs.permissions.enabled
或dfs.datanode.data.data.dir.dir.perm
。On your EMR cluster, you can add the
yarn
user to the super user group; by default, this group is calledsupergroup
. You can confirm if this is the right group by checking thedfs.permissions.superusergroup
in thehdfs-site.xml
file.You could also try modifying the following HDFS properties (in the file named above):
dfs.permissions.enabled
ordfs.datanode.data.dir.perm
.