在ubuntu中无法打开Windows中创建的镶木材料文件
因此,我已经在Windows 10计算机上使用以下行创建了一个镶木quet文件
# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0
import pandas as pd
df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')
,现在我在Azure Pipelines中创建了一个管道,我想通过执行Python脚本来加载同一文件。执行Python脚本的代理商的操作系统是Ubuntu 20.04.4。该脚本的内容:
# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0
import pandas as pd
parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)
但是,最后一行给我一个错误,
Traceback (most recent call last):
File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
df = pd.read_parquet(parquet_file_path)
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
dataset = _ParquetDatasetV2(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
[fragment], schema=fragment.physical_schema,
File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
有人知道为什么要提出这个问题以及如何解决它吗? 我一直在互联网上浏览,但是我找不到任何指出在不同OS上写/阅读Parquet文件方面的差异。
Python版本在我的PC上的3.9均为代理VM上的版本。
So I have created a parquet file on my Windows 10 computer using the following lines
# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0
import pandas as pd
df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')
Now I'm creating a pipeline in Azure Pipelines where I want to load in that same file by executing a Python script. The OS of the agent executing the python script is Ubuntu 20.04.4. The contents of that script:
# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0
import pandas as pd
parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)
However, this last line gives me an error
Traceback (most recent call last):
File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
df = pd.read_parquet(parquet_file_path)
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
dataset = _ParquetDatasetV2(
File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
[fragment], schema=fragment.physical_schema,
File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Does anyone know why this issue is raised and how to solve it?
I've been browsing over the internet but I couldn't find anything pointing out differences in writing/reading parquet files on different OS.
Python version is 3.9 both on my PC as on the agent VM.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此,在花了很多小时来解决这个问题之后,我发现了此错误和解决方案的原因。
这个问题与不同的OS或软件包版本或任何问题无关。
我提到的文件是Git LFS的一部分。因此,该文件不再是镶木木文件,而是指向此文件的链接。
解决方案是确保在尝试访问之前下载任何相关文件。
在我使用Azure管道的特定情况下,我在这里找到了解决方案:
a>
So after spending quite some more hours into this problem I found the cause for this error and the solution.
The issue has nothing to do with different OS or package version or whatsoever.
The file that I referred to is part of GIT lfs. Therefore the file was not a parquet file anymore but a link to such a file.
The solution was to make sure that any relevant files are downloaded before trying to access.
In my specific case using Azure Pipelines I found the solution here:
How to use Git LFS with Azure Repos and Pipelines