使用 Python 提取文件名中含有无效字符的文件

发布于 2024-08-12 05:27:39 字数 961 浏览 6 评论 0 原文

我使用 python 的 zipfile 模块提取 .zip 存档(让我们在 http://img 获取此文件例如.dafont.com/dl/?f=akvaleir。)

f = zipfile.ZipFile('akvaleir.zip', 'r')
for fileinfo in f.infolist():
    print fileinfo.filename
    f.extract(fileinfo, '.')

其输出:

Akval�ir_Normal_v2007.ttf
Akval�ir, La police - The Font - Fr - En.pdf

两个文件在提取后都无法访问,因为它们的文件名中存在无效的编码字符。问题是 zipfile 模块没有指定输出文件名的选项。

但是,“unzip akvaleir.zip”很好地转义了文件名:

root@host:~# unzip akvaleir.zip 
Archive:  akvaleir.zip
  inflating: AkvalВir_Normal_v2007.ttf  
  inflating: AkvalВir, La police - The Font - Fr - En.pdf  

我尝试在 python 程序中捕获“unzip -l akvaleir.zip”的输出,这两个文件名是:

Akval\xd0\x92ir_Normal_v2007.ttf
Akval\xd0\x92ir, La police - The Font - Fr - En.pdf

如何在不捕获的情况下获得正确的文件名,就像 unzip 命令所做的那样“unzip -l akvaleir.zip”的输出?

I use python's zipfile module to extract a .zip archive (Let's take this file at http://img.dafont.com/dl/?f=akvaleir for example.)

f = zipfile.ZipFile('akvaleir.zip', 'r')
for fileinfo in f.infolist():
    print fileinfo.filename
    f.extract(fileinfo, '.')

Its output:

Akval�ir_Normal_v2007.ttf
Akval�ir, La police - The Font - Fr - En.pdf

Both files are unaccessable after extraction because there are invalid encoded characters in their filenames. The problem is zipfile module doesn't have an option to specify output filenames.

However, "unzip akvaleir.zip" escapes the filename well:

root@host:~# unzip akvaleir.zip 
Archive:  akvaleir.zip
  inflating: AkvalВir_Normal_v2007.ttf  
  inflating: AkvalВir, La police - The Font - Fr - En.pdf  

I tried capturing output of "unzip -l akvaleir.zip" in my python program and these two filenames are:

Akval\xd0\x92ir_Normal_v2007.ttf
Akval\xd0\x92ir, La police - The Font - Fr - En.pdf

How can I get the correct filename like what unzip command does without capturing output of "unzip -l akvaleir.zip"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

还在原地等你 2024-08-19 05:27:39

虽然花了一些时间,但我想我找到了答案。

我猜想这个词应该是 Akvaléir。我找到了有关此内容的法语页面描述。当我使用你的代码片段时,我有一个类似 That 的字符串

>>> fileinfo.filename
'Akval\x82ir, La police - The Font - Fr - En.pdf'
>>> 

不适用于 UTF8、Latin-1、CP-1251 或 CP-1252 编码。然后我发现 CP863 可能是加拿大编码,因此可能来自法语加拿大。

>>> print unicode(fileinfo.filename, "cp863").encode("utf8")
Akvaléir, La police - The Font - Fr - En.pdf
>>> 

但是,我随后阅读了 Zip 文件格式规范,其中写着

ZIP 格式历史上
仅支持原装 IBM PC
字符编码集,常见
称为 IBM 代码页 437。

...

如果设置了通用位 11,则
文件名和注释必须支持
Unicode 标准,版本 4.1.0 或
更大程度地使用字符编码
由UTF-8存储定义的形式
规格。

测试结果给了我与加拿大代码页相同的答案

>>> print unicode(fileinfo.filename, "cp437").encode("utf8")
Akvaléir, La police - The Font - Fr - En.pdf
>>>

我没有 Unicode 编码的 zip 文件,并且我不会创建一个来找出答案,所以我假设所有 zip 文件都具有 cp437 编码。

import shutil
import zipfile

f = zipfile.ZipFile('akvaleir.zip', 'r')
for fileinfo in f.infolist():
    filename = unicode(fileinfo.filename, "cp437")
    outputfile = open(filename, "wb")
    shutil.copyfileobj(f.open(fileinfo.filename), outputfile)

在我的 Mac 上,它给出了

 109936 Nov 27 01:46 Akvale??ir_Normal_v2007.ttf
  25244 Nov 27 01:46 Akvale??ir, La police - The Font - Fr - En.pdf

哪个选项卡完成

ls Akvale\314\201ir

并在我的文件浏览器中显示为一个漂亮的“é”。

It took some time but I think I found the answer.

I assumed the word was supposed to be Akvaléir. I found a page description about that, in French. When I used your code snippet I had a string like

>>> fileinfo.filename
'Akval\x82ir, La police - The Font - Fr - En.pdf'
>>> 

That didn't work at UTF8, Latin-1, CP-1251 or CP-1252 encodings. I then found that CP863 was a possible Canadian encoding, so perhaps this was from French Canada.

>>> print unicode(fileinfo.filename, "cp863").encode("utf8")
Akvaléir, La police - The Font - Fr - En.pdf
>>> 

However, I then read the Zip file format specification which says

The ZIP format has historically
supported only the original IBM PC
character encoding set, commonly
referred to as IBM Code Page 437.

...

If general purpose bit 11 is set, the
filename and comment must support The
Unicode Standard, Version 4.1.0 or
greater using the character encoding
form defined by the UTF-8 storage
specification.

Testing that out gives me the same answer as the Canadian code page

>>> print unicode(fileinfo.filename, "cp437").encode("utf8")
Akvaléir, La police - The Font - Fr - En.pdf
>>>

I don't have a Unicode encoded zip file and I'm not going to create one to find out, so I'll just assume that all zip files have the cp437 encoding.

import shutil
import zipfile

f = zipfile.ZipFile('akvaleir.zip', 'r')
for fileinfo in f.infolist():
    filename = unicode(fileinfo.filename, "cp437")
    outputfile = open(filename, "wb")
    shutil.copyfileobj(f.open(fileinfo.filename), outputfile)

On my Mac that gives

 109936 Nov 27 01:46 Akvale??ir_Normal_v2007.ttf
  25244 Nov 27 01:46 Akvale??ir, La police - The Font - Fr - En.pdf

which tab-completes to

ls Akvale\314\201ir

and shows up with a nice 'é' in my file browser.

就是爱搞怪 2024-08-19 05:27:39

使用 extract 方法"noreferrer">open 方法并将生成的伪文件以您希望的任何名称保存到磁盘,例如使用 shutil.copyfileobj

Instead of the extract method, use the open method and save the resulting pseudofile to disk under whatever name you wish, for example with shutil.copyfileobj.

鼻尖触碰 2024-08-19 05:27:39

我在使用 Docker 运行应用程序时遇到了类似的问题。将此行添加到 Dockerfile 中,为我解决了所有问题:

RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

所以,我想如果您不使用 Docker,请尝试一下并确保正确生成和设置语言环境。

I ran into a similar issue while running my application using Docker. Adding this lines to the Dockerfile, fixed everything for me:

RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

So, I guess if you're not using Docker, give it a try and make sure locales are properly generated and set.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文