arrowinvalid:getfileinfo()产生的路径在基本dir parquet之外
我在我的S3存储桶中存储了一个带有多个分区文件的镶木quet数据集。我想将其阅读到我的熊猫数据框架中,但是当我以前没有时会遇到此Arrowinvalid错误。
有时,此数据已被一些先前的pandas数据拍摄,如下所示:
import pandas as pd # version 1.3.4
# pyarrow version 5.0
df.to_parquet(
f's3a://{bucket_and_prefix}',
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
},
index=False
)
但是当阅读它时:
df = pd.read_parquet(
f"s3a://{bucket_and_prefix}",
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
}
)
它失败而错误:
arrowinvalid:getfileinfo()产生'bucket/folder/data.parquet/year = 2021/noter = 2/abcde.parquet',它是基础dir dir's3://bucket/folder/data.parquet'
arrowinvalid:getfileinfo()产生路径'bucket/folder / 这个箭头错误发生了,我如何将木木木数据读取到熊猫中?
I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before.
Occasionally, this data has been overwritten with some previous snapshot of pandas data like the following:
import pandas as pd # version 1.3.4
# pyarrow version 5.0
df.to_parquet(
f's3a://{bucket_and_prefix}',
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
},
index=False
)
But when reading it with:
df = pd.read_parquet(
f"s3a://{bucket_and_prefix}",
storage_options={
"key" : os.getenv("AWS_ACCESS_KEY_ID"),
"secret" : os.getenv("AWS_SECRET_ACCESS_KEY"),
"client_kwargs": {
'verify' : os.getenv('AWS_CA_BUNDLE'),
'endpoint_url': 'https://prd-data.company.com/'
}
}
)
It fails with error:
ArrowInvalid: GetFileInfo() yielded path 'bucket/folder/data.parquet/year=2021/month=2/abcde.parquet', which is outside base dir 's3://bucket/folder/data.parquet'
Any idea why this ArrowInvalid error happens and how I can read the parquet data into pandas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论