如何将数据从远程HDFS加载到Spark中?

发布于 2025-01-23 16:04:42 字数 80 浏览 2 评论 0原文

我们的数据存储在一个遥控的Hadoop群集中,但是要进行一些POC,我需要在我的计算机上本地运行Spark应用程序。如何从该远程HDFS加载数据?

Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

╭⌒浅淡时光〆 2025-01-30 16:04:42

您可以

自定义Hadoop/Hive配置

如果您的Spark应用程序正在与Hadoop,Hive或两者相互作用,则可能有Hadoop/Hive
Spark的classpath中的配置文件

多个运行的应用程序可能需要不同的Hadoop/Hive
客户端配置。您可以复制和修改HDFS-site.xml,
core-site.xml,yarn-site.xml,hive-site.xml在Spark的类Path
每个应用程序。在纱线上运行的火花集群中,这些
配置文件设置为整个集群,无法安全更改
通过应用程序。

当您想要访问HDF时,您需要:HDFS-SITE.xml和Core-Site.xml从您的群集中尝试访问。

You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)

Custom Hadoop/Hive Configuration

If your Spark application is interacting with Hadoop, Hive, or both, there are probably Hadoop/Hive
configuration files in Spark’s classpath.

Multiple running applications might require different Hadoop/Hive
client side configurations. You can copy and modify hdfs-site.xml,
core-site.xml, yarn-site.xml, hive-site.xml in Spark’s classpath for
each application. In a Spark cluster running on YARN, these
configuration files are set cluster-wide, and cannot safely be changed
by the application.

As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.

怂人 2025-01-30 16:04:42

对于想要从Spark Java应用程序访问远程HDF的任何人,这是步骤。

首先,您需要在运行命令中添加-conf键。取决于火花版本:(

  • Spark 1.x-2.1)
    spark.yarn.access.namenodes = hdfs:// clustera,hdfs:// clusterb
  • (spark 2.2+)spark.yarn.access.hadoopfilesystems = hdfs = hdfs:// clustera,hdfs,hdfs:// hdfs sectionbly section sectly sectly sectly sectly sectly sectly section the Create inter ins

insure上下文,补充:

javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));

如果您面临此例外:

java.net.unknownhostexception:clusterb

然后尝试将远程HDF的完整Namenode地址(而不是HDFS/cluster缩短名称)放在-CONF中。

我的文章中的更多详细信息: https://mchesnavsky.tech/spark-java-java-java-ava-java-access- -emote-hdfs

For anyone, who wants to access remote HDFS from Spark Java app, here is steps.

Firstly, you need to add --conf key to your run command. Depends on Spark version:

  • (Spark 1.x-2.1)
    spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
  • (Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB

Secondly, when you creating Spark’s Java context, add that:

javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));

If you facing this exception:

java.net.UnknownHostException: clusterB

then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.

More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文