阅读木薯片到熊猫filenotfounderror
我的代码如下,运行良好。 数据
April_data = sc.read.parquet('somepath/data.parquet')
type(April_data)
pyspark.sql.dataframe.DataFrame
为火花
df_pp = pd.read_parquet('somepath/data.parquet')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_4244/1910461502.py in <module>
----> 1 df_pp = pd.read_parquet('somepath/data.parquet')
/usr/local/anaconda//parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
498 storage_options=storage_options,
499 use_nullable_dtypes=use_nullable_dtypes,
--> 500 **kwargs,
501 )
/usr/local/anaconda//io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
234 kwargs.pop("filesystem", None),
235 storage_options=storage_options,
--> 236 mode="rb",
237 )
238 try:
/usr/local/anaconda/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
100 # this branch is used for example when reading from non-fsspec URLs
101 handles = get_handle(
--> 102 path_or_handle, mode, is_text=False, storage_options=storage_options
103 )
104 fs = None
/usr/local/anaconda/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
709 else:
710 # Binary mode
--> 711 handle = open(handle, ioargs.mode)
712 handles.append(handle)
713
FileNotFoundError: [Errno 2] No such file or directory: 'somepath/data.parquet'
读
!pip install fastparquet
Successfully installed cramjam-2.5.0 fastparquet-0.8.1
它
框当我这样做时,我可以
hdfs_location = 'somepath/'
!hdfs dfs -ls $hdfs_location
在同一文件中运行所有这些代码
I have code as below and it runs fine. It reads as a spark dataframe
April_data = sc.read.parquet('somepath/data.parquet')
type(April_data)
pyspark.sql.dataframe.DataFrame
But when I try to read as a pandas df I get error
df_pp = pd.read_parquet('somepath/data.parquet')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_4244/1910461502.py in <module>
----> 1 df_pp = pd.read_parquet('somepath/data.parquet')
/usr/local/anaconda//parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
498 storage_options=storage_options,
499 use_nullable_dtypes=use_nullable_dtypes,
--> 500 **kwargs,
501 )
/usr/local/anaconda//io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
234 kwargs.pop("filesystem", None),
235 storage_options=storage_options,
--> 236 mode="rb",
237 )
238 try:
/usr/local/anaconda/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
100 # this branch is used for example when reading from non-fsspec URLs
101 handles = get_handle(
--> 102 path_or_handle, mode, is_text=False, storage_options=storage_options
103 )
104 fs = None
/usr/local/anaconda/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
709 else:
710 # Binary mode
--> 711 handle = open(handle, ioargs.mode)
712 handles.append(handle)
713
FileNotFoundError: [Errno 2] No such file or directory: 'somepath/data.parquet'
I have installed fastparquet
package as below
!pip install fastparquet
Successfully installed cramjam-2.5.0 fastparquet-0.8.1
# udpate 1
the file is located in HDFS and I can see the file when I do
hdfs_location = 'somepath/'
!hdfs dfs -ls $hdfs_location
I am running all this code in the same file
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
每个文档, 与其他兄弟姐妹IO模块类似,不支持HDFS位置的阅读。虽然有 ,它没有读取镶木木材或其他已知格式。
对于
Read_Parquet
中的字符串值,当前支持CPU文件路径或仅在线方案(HTTP,FTP)和两个特定的存储路径(Amazon S3存储桶,Google Cloud Storage或GS)。但是,您可以通过类似文件的对象。因此,请考虑阅读所需的镶木quet文件并传递内容。以下是使用各种HDFS软件包的示例:
另外, > 支持转换为熊猫数据框架:
Per docs,
pandas.read_parquet
, similar to other sibling IO modules, does not support reading from HDFS locations. While there isread_hdf
, it does not read parquet or other known formats.For string values in
read_parquet
, CPU file paths or only online schemes (http, ftp) and two specific storage paths (Amazon S3 buckets, Google Cloud Storage or GS) are currently supported.However, you can pass file-like objects. So consider reading the needed parquet file and pass content. Below are examples using various HDFS packages:
Also,
fastparquet
supports conversion to pandas data frame: