如何将数据从远程HDFS加载到Spark中?
我们的数据存储在一个遥控的Hadoop群集中,但是要进行一些POC,我需要在我的计算机上本地运行Spark应用程序。如何从该远程HDFS加载数据?
Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以
当您想要访问HDF时,您需要:HDFS-SITE.xml和Core-Site.xml从您的群集中尝试访问。
You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)
As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.
对于想要从Spark Java应用程序访问远程HDF的任何人,这是步骤。
首先,您需要在运行命令中添加-conf键。取决于火花版本:(
spark.yarn.access.namenodes = hdfs:// clustera,hdfs:// clusterb
insure上下文,补充:
如果您面临此例外:
然后尝试将远程HDF的完整Namenode地址(而不是HDFS/cluster缩短名称)放在-CONF中。
我的文章中的更多详细信息: https://mchesnavsky.tech/spark-java-java-java-ava-java-access- -emote-hdfs 。
For anyone, who wants to access remote HDFS from Spark Java app, here is steps.
Firstly, you need to add --conf key to your run command. Depends on Spark version:
spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
Secondly, when you creating Spark’s Java context, add that:
If you facing this exception:
then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.
More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.