无法确定Excel文件格式，必须手动指定引擎

发布于 2025-01-10 15:49:31 字数 923 浏览 0 评论 0原文

我不确定为什么会收到此错误，尽管有时我的代码工作正常！

无法确定Excel文件格式，您必须手动指定引擎。

以下是我的代码，包含步骤：

1-客户ID列列表：

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2-查找文件夹中所有xlsx文件的代码并阅读它们：

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

我添加了引擎 openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

现在我得到了一个不同的错误：

BadZipFile: File is not a zip file

pandas 版本：1.3.0 python版本：python3.9 os：MacOS

有没有更好的方法从文件夹中读取所有 xlsx 文件？

原文

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0
python version: python3.9
os: MacOS

is there a better way to read all xlsx files from a folder ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝颜夕 2025-01-17 15:49:31

找到了。例如，当 MS Excel 打开 Excel 文件时，会在同一目录中创建隐藏的临时文件：

~$datasheet.xlsx

因此，当我运行代码从文件夹中读取所有文件时，会出现错误：

Excel file format cannot be determined, you must specify an engine manually.

当所有文件都关闭且 <强>没有隐藏的临时文件 ~$filename.xlsx 在同一目录中，代码完美运行。

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

回复收藏 0 原文

稀香 2025-01-17 15:49:31

另请确保您使用正确的 pd.read_* 方法。我在尝试使用 read_excel() 而不是 read_csv() 打开 .csv 文件时遇到了此错误。我发现了这个方便的片段此处可根据 Excel 文件类型自动选择正确的方法。

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())

Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())

回复收藏 0 原文

笑看君怀她人 2025-01-17 15:49:31

就我而言，我使用了xlrd。因此，在终端中：

pip install xlrd

如果未安装 pandas，请安装它：

pip install pandas

现在以这种方式读取 excel 文件：

import pandas as pd

df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')

In my case, I usedxlrd. So in terminal:

pip install xlrd

If pandas is not installed, install it:

pip install pandas

Now read the excel file this way:

import pandas as pd

df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')

回复收藏 0 原文

哑剧 2025-01-17 15:49:31

您可以通过检查文件是否以“~”开头来过滤掉不需要的临时文件。

import os

for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)

You can filter out the unwanted temp files by checking if file starts with "~".

import os

for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)

回复收藏 0 原文

何必那么矫情 2025-01-17 15:49:31

https://stackoverflow.com/a/32241271/17411729

链接到有关如何删除隐藏文件的答案

Mac = go到文件夹按 cmd + shift + 。
将显示隐藏文件，将其删除，然后再运行。

回复收藏 0 原文

念﹏祤嫣 2025-01-17 15:49:31

在 macOS 中，每个文件夹中都会自动生成一个名为“.DS_Store”的“不可见文件”。对我来说，这就是问题的根源。我用 if 语句解决了这个问题，以绕过“不可见文件”（这不是 xlsx，因此会触发错误）

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)

In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)

回复收藏 0 原文

樱娆 2025-01-17 15:49:31

当我手动将“CSV”后缀更改为“XLS”时，我还收到“Excel 文件格式...”错误。我所要做的就是打开 Excel 并将其保存为我想要的格式。

回复收藏 0 原文

泛泛之交 2025-01-17 15:49:31

看起来这个问题很容易解决。转到您的 Excel 文件，无论是 xls、xlsx 还是任何其他扩展名，然后从文件图标“另存为”。当提示选项时。另存为 CSV UTF-8（逗号分隔）(*.csv)

回复收藏 0 原文

梦里人 2025-01-17 15:49:31

对我来说，没有像 ~$datasheet.xlsx 这样的临时 Excel 文件。
但是...当我不小心在 Excel 中打开其中一个文件时，不知何故，Excel 使该文件无法使用。
当我再次尝试用Excel打开它时，我意识到了这一点，Excel也无法打开它。
所以我不得不再次获取文件，这解决了问题。

回复收藏 0 原文

坚持沉默 2025-01-17 15:49:31

如果您在尝试使用 URL 从 Google 表格中读取表格时遇到上述错误，请尝试 read_html（请注意，read_html 返回一个列表，因此我们需要索引获取数据帧）：

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"

df = pd.read_excel(url)           # <--- ValueError: Excel file format...
df = pd.read_html(url)[0]         # <--- OK

另一种可行的方法是导出到 csv（将 URL 修改为 /export 而不是 /edit 并使用 read_csv）：

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"
export_url = url.split("/edit", 1)[0] + "/export?format=csv"
df = pd.read_csv(export_url)

If you got the above error when you tried to read a table from Google Sheets using its URL, then try read_html (note that read_html returns a list, so we need to index it to get the dataframe):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"

df = pd.read_excel(url)           # <--- ValueError: Excel file format...
df = pd.read_html(url)[0]         # <--- OK

Another way that also works is to export to csv (modify the URL to /export instead of /edit and use read_csv):

url = "https://docs.google.com/spreadsheets/d/{long-spreadsheet-key}/edit"
export_url = url.split("/edit", 1)[0] + "/export?format=csv"
df = pd.read_csv(export_url)

回复收藏 0 原文