无法确定Excel文件格式,必须手动指定引擎

发布于 2025-01-10 15:49:31 字数 923 浏览 0 评论 0原文

我不确定为什么会收到此错误,尽管有时我的代码工作正常!

无法确定Excel文件格式,您必须手动指定引擎。

以下是我的代码,包含步骤:

1-客户ID列列表:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2-查找文件夹中所有xlsx文件的代码并阅读它们:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

我添加了引擎 openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

现在我得到了一个不同的错误:

BadZipFile: File is not a zip file

pandas 版本:1.3.0 python版本:python3.9 os:MacOS

有没有更好的方法从文件夹中读取所有 xlsx 文件?

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0
python version: python3.9
os: MacOS

is there a better way to read all xlsx files from a folder ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

蓝颜夕 2025-01-17 15:49:31

找到了。例如,当 MS Excel 打开 Excel 文件时,会在同一目录中创建隐藏的临时文件:

~$datasheet.xlsx

因此,当我运行代码从文件夹中读取所有文件时,会出现错误:

Excel file format cannot be determined, you must specify an engine manually.

当所有文件都关闭且 <强>没有隐藏的临时文件 ~$filename.xlsx 在同一目录中,代码完美运行。

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

稀香 2025-01-17 15:49:31

另请确保您使用正确的 pd.read_* 方法。我在尝试使用 read_excel() 而不是 read_csv() 打开 .csv 文件时遇到了此错误。我发现了这个方便的片段 此处可根据 Excel 文件类型自动选择正确的方法。

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())

Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())
笑看君怀她人 2025-01-17 15:49:31

就我而言,我使用了xlrd。因此,在终端中:

pip install xlrd

如果未安装 pandas,请安装它:

pip install pandas

现在以这种方式读取 excel 文件:

import pandas as pd

df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')

In my case, I usedxlrd. So in terminal:

pip install xlrd

If pandas is not installed, install it:

pip install pandas

Now read the excel file this way:

import pandas as pd

df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')
哑剧 2025-01-17 15:49:31

您可以通过检查文件是否以“~”开头来过滤掉不需要的临时文件。

import os

for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)

You can filter out the unwanted temp files by checking if file starts with "~".

import os

for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)
何必那么矫情 2025-01-17 15:49:31

https://stackoverflow.com/a/32241271/17411729

链接到有关如何删除隐藏文件的答案

Mac = go到文件夹按 cmd + shift + 。
将显示隐藏文件,将其删除,然后再运行。

https://stackoverflow.com/a/32241271/17411729

link to an answer on how to remove hidden files

Mac = go to folder press cmd + shift + .
will show the hidden file, delete it, run it back.

念﹏祤嫣 2025-01-17 15:49:31

在 macOS 中,每个文件夹中都会自动生成一个名为“.DS_Store”的“不可见文件”。对我来说,这就是问题的根源。我用 if 语句解决了这个问题,以绕过“不可见文件”(这不是 xlsx,因此会触发错误)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)

In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)
樱娆 2025-01-17 15:49:31

当我手动将“CSV”后缀更改为“XLS”时,我还收到“Excel 文件格式...”错误。我所要做的就是打开 Excel 并将其保存为我想要的格式。

I also got an 'Excel file format...' error when I manually changed the 'CSV' suffix to 'XLS'. All I had to do was open excel and save it to the format I wanted.

泛泛之交 2025-01-17 15:49:31

看起来这个问题很容易解决。转到您的 Excel 文件,无论是 xls、xlsx 还是任何其他扩展名,然后从文件图标“另存为”。当提示选项时。另存为 CSV UTF-8(逗号分隔)(*.csv)

Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)

梦里人 2025-01-17 15:49:31

对我来说,没有像 ~$datasheet.xlsx 这样的临时 Excel 文件。
但是...当我不小心在 Excel 中打开其中一个文件时,不知何故,Excel 使该文件无法使用。
当我再次尝试用Excel打开它时,我意识到了这一点,Excel也无法打开它。
所以我不得不再次获取文件,这解决了问题。

For me, there were no temporary Excel files like ~$datasheet.xlsx.
but... When I accidentally opened one of the files in Excel, somehow, Excel made the file unuseable.
I realised this when I tried to open it with Excel again, and Excel could not open it, either.
So I had to get the file again, which solved the issue.

坚持沉默 2025-01-17 15:49:31

如果您在尝试使用 URL 从 Google 表格中读取表格时遇到上述错误,请尝试 read_html(请注意,read_html 返回一个列表,因此我们需要索引获取数据帧):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"

df = pd.read_excel(url)           # <--- ValueError: Excel file format...
df = pd.read_html(url)[0]         # <--- OK

另一种可行的方法是导出到 csv(将 URL 修改为 /export 而不是 /edit 并使用 read_csv):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"
export_url = url.split("/edit", 1)[0] + "/export?format=csv"
df = pd.read_csv(export_url)

If you got the above error when you tried to read a table from Google Sheets using its URL, then try read_html (note that read_html returns a list, so we need to index it to get the dataframe):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"

df = pd.read_excel(url)           # <--- ValueError: Excel file format...
df = pd.read_html(url)[0]         # <--- OK

Another way that also works is to export to csv (modify the URL to /export instead of /edit and use read_csv):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"
export_url = url.split("/edit", 1)[0] + "/export?format=csv"
df = pd.read_csv(export_url)
岛歌少女 2025-01-17 15:49:31

我看到的相同错误是由于 JupyterHub 中的最小笔记本 Docker 映像缺少 Git LFS(大文件存储)支持。

由于默认情况下该映像不包含 Git LFS,因此我的 Git 存储库中使用 Git LFS 提交的任何文件(例如 .xlsx 文件)仅部分克隆为指针文件。 pandas 无法读取这些损坏的文件。

文件大小可用于检查 xlsx 是否只是指针文件。我的修复需要一个带有 git lfs 的自定义笔记本图像,以允许 jupyter 与我们现有的 git 存储库很好地工作。

The same error I saw was due to missing Git LFS (Large File Storage) support in the minimal notebook Docker image in JupyterHub.

Since the image doesn't include Git LFS by default, any files in my Git repository that were committed with Git LFS such as .xlsx files are only partially cloned as pointer files. These corrupted files are not readable by pandas.

File size can be used to check if the xlsx is only the pointer file. My fix would need a custom notebook image with git lfs to allow jupyter work nicely with our existing git repos.

荆棘i 2025-01-17 15:49:31

当我正在读取的文件打开时,我在 mac 中遇到了此错误消息。当我关闭文件并再次运行代码时,问题得到解决。

I faced this error message in mac when the file I was reading was open. When I closed the file and ran the code again, the issue was resolved.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文