方案“s3”没有文件系统尝试使用 Spark 从 EC2 读取文件列表时
我试图提供一个文件列表,供 Spark 在需要时读取(这就是为什么我宁愿不使用 boto 或其他任何东西将所有文件预先下载到实例上,然后才将它们读入火花“本地”)。
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])
不知道 local[3]
是什么,但没有这个 --master
标志,我收到另一个异常:
Exception: Java gateway process exited before send the driver它的端口号。
现在,我得到了这个:
Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...
不确定 o37.json
在这里指的是什么,但这可能并不重要。
我看到了很多类似问题的答案,建议添加一些标志,例如:
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
我尝试将其放在前面并将其附加到另一个标志,但它不起作用。
就像我在其他答案和互联网上其他地方看到的许多变体(具有不同的软件包和版本),例如:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'
I'm trying to provide a list of files for spark to read as and when it needs them (which is why I'd rather not use boto or whatever else to pre-download all the files onto the instance and only then read them into spark "locally").
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])
No idea what local[3]
is about but without this --master
flag, I was getting another exception:
Exception: Java gateway process exited before sending the driver its port number.
Now, I'm getting this:
Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...
Not sure what o37.json
refers to here but it probably doesn't matter.
I saw a bunch of answers to similar questions suggesting an addition of flags like:
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
I tried prepending it and appending it to the other flag but it doesn't work.
Just like the many variations I see in other answers and elsewhere on the internet (with different packages and versions), for example:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从 S3 读取文件的典型示例如下 -
另外,您可以浏览此答案以确保简约的结构和必要的模块就位 -
java.io.IOException:方案没有文件系统:s3
读取 Parquet - S3
A typical example for reading files from S3 is as below -
Additional you can go through this answer to ensure the minimalistic structure and necessary modules are in place -
java.io.IOException: No FileSystem for scheme: s3
Read Parquet - S3