在ubuntu中无法打开Windows中创建的镶木材料文件

发布于 2025-02-01 21:04:10 字数 2074 浏览 3 评论 0原文

因此,我已经在Windows 10计算机上使用以下行创建了一个镶木quet文件

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')

,现在我在Azure Pipelines中创建了一个管道,我想通过执行Python脚本来加载同一文件。执行Python脚本的代理商的操作系统是Ubuntu 20.04.4。该脚本的内容:

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)

但是,最后一行给我一个错误,

Traceback (most recent call last):
  File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
    df = pd.read_parquet(parquet_file_path)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
    dataset = _ParquetDatasetV2(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
    [fragment], schema=fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

有人知道为什么要提出这个问题以及如何解决它吗? 我一直在互联网上浏览,但是我找不到任何指出在不同OS上写/阅读Parquet文件方面的差异。

Python版本在我的PC上的3.9均为代理VM上的版本。

So I have created a parquet file on my Windows 10 computer using the following lines

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')

Now I'm creating a pipeline in Azure Pipelines where I want to load in that same file by executing a Python script. The OS of the agent executing the python script is Ubuntu 20.04.4. The contents of that script:

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)

However, this last line gives me an error

Traceback (most recent call last):
  File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
    df = pd.read_parquet(parquet_file_path)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
    dataset = _ParquetDatasetV2(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
    [fragment], schema=fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Does anyone know why this issue is raised and how to solve it?
I've been browsing over the internet but I couldn't find anything pointing out differences in writing/reading parquet files on different OS.

Python version is 3.9 both on my PC as on the agent VM.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

随波逐流 2025-02-08 21:04:10

因此,在花了很多小时来解决这个问题之后,我发现了此错误和解决方案的原因。
这个问题与不同的OS或软件包版本或任何问题无关。

我提到的文件是Git LFS的一部分。因此,该文件不再是镶木木文件,而是指向此文件的链接。
解决方案是确保在尝试访问之前下载任何相关文件。
在我使用Azure管道的特定情况下,我在这里找到了解决方案:
a>

So after spending quite some more hours into this problem I found the cause for this error and the solution.
The issue has nothing to do with different OS or package version or whatsoever.

The file that I referred to is part of GIT lfs. Therefore the file was not a parquet file anymore but a link to such a file.
The solution was to make sure that any relevant files are downloaded before trying to access.
In my specific case using Azure Pipelines I found the solution here:
How to use Git LFS with Azure Repos and Pipelines

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文