无法确定Excel文件格式,必须手动指定引擎
我不确定为什么会收到此错误,尽管有时我的代码工作正常!
无法确定Excel文件格式,您必须手动指定引擎。
以下是我的代码,包含步骤:
1-客户ID列列表:
customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]
2-查找文件夹中所有xlsx文件的代码并阅读它们:
l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
df.columns = ["ID"] # to have only one column once concat
l.append(df)
all_data = pd.concat(l, ignore_index=True) # concat all data
我添加了引擎 openpyxl
df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)
现在我得到了一个不同的错误:
BadZipFile: File is not a zip file
pandas 版本:1.3.0 python版本:python3.9 os:MacOS
有没有更好的方法从文件夹中读取所有 xlsx 文件?
I am not sure why I am getting this error although sometimes my code works fine!
Excel file format cannot be determined, you must specify an engine manually.
Here below is my code with steps:
1- list of columns of customers Id:
customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]
2- The code to find all xlsx files in a folder and read them:
l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
df.columns = ["ID"] # to have only one column once concat
l.append(df)
all_data = pd.concat(l, ignore_index=True) # concat all data
I added the engine openpyxl
df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)
Now I got a different error:
BadZipFile: File is not a zip file
pandas version: 1.3.0
python version: python3.9
os: MacOS
is there a better way to read all xlsx files from a folder ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
找到了。例如,当 MS Excel 打开 Excel 文件时,会在同一目录中创建隐藏的临时文件:
因此,当我运行代码从文件夹中读取所有文件时,会出现错误:
当所有文件都关闭且 <强>没有隐藏的临时文件
~$filename.xlsx
在同一目录中,代码完美运行。Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:
So, when I run the code to read all the files from the folder it gives me the error:
When all files are closed and no hidden temporary files
~$filename.xlsx
in the same directory the code works perfectly.另请确保您使用正确的
pd.read_*
方法。我在尝试使用read_excel()
而不是read_csv()
打开.csv
文件时遇到了此错误。我发现了这个方便的片段 此处可根据 Excel 文件类型自动选择正确的方法。Also make sure you're using the correct
pd.read_*
method. I ran into this error when attempting to open a.csv
file withread_excel()
instead ofread_csv()
. I found this handy snippet here to automatically select the correct method by Excel file type.就我而言,我使用了
xlrd
。因此,在终端中:如果未安装 pandas,请安装它:
现在以这种方式读取 excel 文件:
In my case, I used
xlrd
. So in terminal:If pandas is not installed, install it:
Now read the excel file this way:
您可以通过检查文件是否以“~”开头来过滤掉不需要的临时文件。
You can filter out the unwanted temp files by checking if file starts with "~".
https://stackoverflow.com/a/32241271/17411729
链接到有关如何删除隐藏文件的答案
Mac = go到文件夹按 cmd + shift + 。
将显示隐藏文件,将其删除,然后再运行。
https://stackoverflow.com/a/32241271/17411729
link to an answer on how to remove hidden files
Mac = go to folder press cmd + shift + .
will show the hidden file, delete it, run it back.
在 macOS 中,每个文件夹中都会自动生成一个名为“.DS_Store”的“不可见文件”。对我来说,这就是问题的根源。我用 if 语句解决了这个问题,以绕过“不可见文件”(这不是 xlsx,因此会触发错误)
In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)
当我手动将“CSV”后缀更改为“XLS”时,我还收到“Excel 文件格式...”错误。我所要做的就是打开 Excel 并将其保存为我想要的格式。
I also got an 'Excel file format...' error when I manually changed the 'CSV' suffix to 'XLS'. All I had to do was open excel and save it to the format I wanted.
看起来这个问题很容易解决。转到您的 Excel 文件,无论是 xls、xlsx 还是任何其他扩展名,然后从文件图标“另存为”。当提示选项时。另存为 CSV UTF-8(逗号分隔)(*.csv)
Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)
对我来说,没有像 ~$datasheet.xlsx 这样的临时 Excel 文件。
但是...当我不小心在 Excel 中打开其中一个文件时,不知何故,Excel 使该文件无法使用。
当我再次尝试用Excel打开它时,我意识到了这一点,Excel也无法打开它。
所以我不得不再次获取文件,这解决了问题。
For me, there were no temporary Excel files like ~$datasheet.xlsx.
but... When I accidentally opened one of the files in Excel, somehow, Excel made the file unuseable.
I realised this when I tried to open it with Excel again, and Excel could not open it, either.
So I had to get the file again, which solved the issue.
如果您在尝试使用 URL 从 Google 表格中读取表格时遇到上述错误,请尝试
read_html
(请注意,read_html
返回一个列表,因此我们需要索引获取数据帧):另一种可行的方法是导出到 csv(将 URL 修改为 /export 而不是 /edit 并使用
read_csv
):If you got the above error when you tried to read a table from Google Sheets using its URL, then try
read_html
(note thatread_html
returns a list, so we need to index it to get the dataframe):Another way that also works is to export to csv (modify the URL to /export instead of /edit and use
read_csv
):我看到的相同错误是由于 JupyterHub 中的最小笔记本 Docker 映像缺少 Git LFS(大文件存储)支持。
由于默认情况下该映像不包含 Git LFS,因此我的 Git 存储库中使用 Git LFS 提交的任何文件(例如
.xlsx
文件)仅部分克隆为指针文件。 pandas 无法读取这些损坏的文件。文件大小可用于检查 xlsx 是否只是指针文件。我的修复需要一个带有 git lfs 的自定义笔记本图像,以允许 jupyter 与我们现有的 git 存储库很好地工作。
The same error I saw was due to missing Git LFS (Large File Storage) support in the minimal notebook Docker image in JupyterHub.
Since the image doesn't include Git LFS by default, any files in my Git repository that were committed with Git LFS such as
.xlsx
files are only partially cloned as pointer files. These corrupted files are not readable by pandas.File size can be used to check if the
xlsx
is only the pointer file. My fix would need a custom notebook image with git lfs to allow jupyter work nicely with our existing git repos.当我正在读取的文件打开时,我在 mac 中遇到了此错误消息。当我关闭文件并再次运行代码时,问题得到解决。
I faced this error message in mac when the file I was reading was open. When I closed the file and ran the code again, the issue was resolved.