使用CLI从Hadoop访问Hadoop的Azure ADLS Gen 2

发布于 2025-02-01 02:10:09 字数 1924 浏览 5 评论 0原文

我基本上想使用hadoop fs -ls从独立的本地Cloudera群集中列出ADLS Gen 2容器下的文件。但是，我遇到了此错误：

命令从bash：

hadoop fs -Dfs.azure.account.key.accountName.dfs.core.windows.net="accessKey" -Dfs.azure.createRemoteFileSystemDuringInitialization=true -ls abfss://[email protected]/

错误：

WARN fs.FileSystem: Failed to initialize fileystem abfss://[email protected]/:Invalid configuration value detected for fs.azure.account.key ls: Invalid configuration value detected for fs.azure.account.key

然后，我通过配置：

sc._jsc.hadoopConfiguration().set('fs.azure.account.auth.type.accountName.dfs.core.windows.net','SharedKey')
sc._jsc.hadoopConfiguration().set('fs.azure.account.key.accountName.core.windows.net','accessKey')
sc._jsc.hadoopConfiguration().set('fs.abfss.impl','org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem')

spark shell的错误：

WARN fs.FileSystem: Failed to initialize fileystem abfss://[email protected]/: Configuration property accountName.dfs.core.windows.net not found

注意：

pyspark read和在设置Spark Conf（）之后，写信给此ADLSGEN2容器正常工作。问题只有当我使用-fs命令尝试时，我最终也想与pyspark一起使用distcpy（）。
我没有在core site.xml上配置任何内容。相反，我想在程序或脚本上下文中独立传递所有键，参数和任何设置，甚至在bash上。寻找符合此标准的解决方案。
另外，由于我只是在运行POC，因此不要为此使用OAuth。目前，我只想使用共享键进行测试检查。

有人可以帮我在这里确定这个问题吗？

原文

I basically want to list files under an ADLS Gen 2 Container using hadoop fs -ls from a standalone on-prem Cloudera Cluster. However I am getting this error:

Command ran from bash:

hadoop fs -Dfs.azure.account.key.accountName.dfs.core.windows.net="accessKey" -Dfs.azure.createRemoteFileSystemDuringInitialization=true -ls abfss://[email protected]/

Error:

WARN fs.FileSystem: Failed to initialize fileystem abfss://[email protected]/:Invalid configuration value detected for fs.azure.account.key ls: Invalid configuration value detected for fs.azure.account.key

Then, I ran this same fs -ls command from within a spark program by configuring:

sc._jsc.hadoopConfiguration().set('fs.azure.account.auth.type.accountName.dfs.core.windows.net','SharedKey')
sc._jsc.hadoopConfiguration().set('fs.azure.account.key.accountName.core.windows.net','accessKey')
sc._jsc.hadoopConfiguration().set('fs.abfss.impl','org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem')

The error from Spark Shell:

WARN fs.FileSystem: Failed to initialize fileystem abfss://[email protected]/: Configuration property accountName.dfs.core.windows.net not found

Note:

PySpark read and write to this ADLSGen2 container is working as expected after setting up the spark conf(). The issue is only when I try this with -fs commands as I eventually want to be using distcpy() as well, along with PySpark.
I haven't configured anything on core-site.xml. Rather, I want to pass all keys, parameters and any settings independently within the program or script's context, even on bash. Looking for a solution that meets this criteria.
Also, not using oAuth for this, since I am just running a POC. For now, I am only interested in checking this out using SharedKey for testing.

Can someone help me identify the issue here?

分享到QQ

分享到微博