Python中将带BOM的UTF-8转换为无BOM的UTF-8
这里有两个问题。我有一组文件,通常是带有 BOM 的 UTF-8。我想将它们(最好就地)转换为没有 BOM 的 UTF-8。看起来codecs.StreamRecoder(stream,encode,decode,Reader,Writer,errors)
会处理这个问题。但我真的没有看到任何好的用法示例。这是处理这个问题的最佳方法吗?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
另外,如果我们能够在不明确知道的情况下处理不同的输入编码(参见 ASCII 和 UTF-16),那就太理想了。看来这一切应该都是可行的。有没有一种解决方案可以将任何已知的Python编码并输出为UTF-8而不带BOM?
编辑1从下面提出的解决方案(谢谢!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
这给了我以下错误:
IOError: [Errno 9] Bad file descriptor
Newsflash
我在评论中被告知错误是我使用模式“rw”而不是“打开文件” r+'/'r+b',所以我最终应该重新编辑我的问题并删除已解决的部分。
Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors)
would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这个答案适用于Python 2
只需使用“utf-8-sig”编解码器:
这为您提供了一个不带 BOM 的
unicode
字符串。然后,您可以使用s
获取正常的 UTF-8 编码字符串。如果您的文件很大,那么您应该避免将它们全部读入内存。 BOM 只是文件开头的三个字节,因此您可以使用以下代码将它们从文件中删除:它打开文件,读取一个块,然后将其写到比读取位置早 3 个字节的文件中它。该文件被就地重写。更简单的解决方案是将较短的文件写入新文件,例如 newtover 的答案。这会更简单,但会在短时间内使用两倍的磁盘空间。
至于猜测编码,那么您可以从最具体到最不具体循环编码:
UTF-16 编码的文件不会解码为 UTF-8,因此我们首先尝试使用 UTF-8。如果失败,那么我们尝试使用 UTF-16。最后,我们使用 Latin-1 ——这将始终有效,因为所有 256 个字节都是 Latin-1 中的合法值。在这种情况下,您可能希望返回
None
,因为这实际上是一个后备,并且您的代码可能希望更仔细地处理这个问题(如果可以的话)。This answer is for Python 2
Simply use the "utf-8-sig" codec:
That gives you a
unicode
string without the BOM. You can then useto get a normal UTF-8 encoded string back in
s
. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return
None
instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).在 Python 3 中,这非常简单:读取文件并使用
utf-8
编码重写它:In Python 3 it's quite easy: read the file and rewrite it with
utf-8
encoding:我发现这个问题是因为在打开带有 UTF8 BOM 标头的文件时遇到
configparser.ConfigParser().read(fp)
问题。对于那些正在寻找解决方案来删除标头以便 ConfigPhaser 可以打开配置文件而不是报告以下错误的人:
文件不包含节标头
,请按如下方式打开文件:这样无需删除文件的 BOM 标头,可以为您节省大量精力。
(我知道这听起来无关,但希望这可以帮助像我一样苦苦挣扎的人。)
I found this question because having trouble with
configparser.ConfigParser().read(fp)
when opening files with UTF8 BOM header.For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers
, please open the file like the following:This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)
这是我的实现,将任何类型的编码转换为没有 BOM 的 UTF-8 并用通用格式替换 windows enlines:
This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
您可以使用编解码器。
You can use codecs.
在 python3 中,您应该添加
encoding='utf-8-sig'
:就是这样。
In python3 you should add
encoding='utf-8-sig'
:that's it.