Python中将带BOM的UTF-8转换为无BOM的UTF-8

发布于 2024-12-27 12:04:32 字数 801 浏览 2 评论 0原文

这里有两个问题。我有一组文件,通常是带有 BOM 的 UTF-8。我想将它们(最好就地)转换为没有 BOM 的 UTF-8。看起来codecs.StreamRecoder(stream,encode,decode,Reader,Writer,errors)会处理这个问题。但我真的没有看到任何好的用法示例。这是处理这个问题的最佳方法吗?

source files:
Tue Jan 17$ file brh-m-157.json 
brh-m-157.json: UTF-8 Unicode (with BOM) text

另外,如果我们能够在不明确知道的情况下处理不同的输入编码(参见 ASCII 和 UTF-16),那就太理想了。看来这一切应该都是可行的。有没有一种解决方案可以将任何已知的Python编码并输出为UTF-8而不带BOM?

编辑1从下面提出的解决方案(谢谢!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding  
fp.write(s)

这给了我以下错误:

IOError: [Errno 9] Bad file descriptor

Newsflash

我在评论中被告知错误是我使用模式“rw”而不是“打开文件” r+'/'r+b',所以我最终应该重新编辑我的问题并删除已解决的部分。

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?

source files:
Tue Jan 17$ file brh-m-157.json 
brh-m-157.json: UTF-8 Unicode (with BOM) text

Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?

edit 1 proposed sol'n from below (thanks!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding  
fp.write(s)

This gives me the following error:

IOError: [Errno 9] Bad file descriptor

Newsflash

I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

千秋岁 2025-01-03 12:04:32

这个答案适用于Python 2

只需使用“utf-8-sig”编解码器

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

这为您提供了一个不带 BOM 的 unicode 字符串。然后,您可以使用

s = u.encode("utf-8")

s 获取正常的 UTF-8 编码字符串。如果您的文件很大,那么您应该避免将它们全部读入内存。 BOM 只是文件开头的三个字节,因此您可以使用以下代码将它们从文件中删除:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

它打开文件,读取一个块,然后将其写到比读取位置早 3 个字节的文件中它。该文件被就地重写。更简单的解决方案是将较短的文件写入新文件,例如 newtover 的答案。这会更简单,但会在短时间内使用两倍的磁盘空间。

至于猜测编码,那么您可以从最具体到最不具体循环编码:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

UTF-16 编码的文件不会解码为 UTF-8,因此我们首先尝试使用 UTF-8。如果失败,那么我们尝试使用 UTF-16。最后,我们使用 Latin-1 ——这将始终有效,因为所有 256 个字节都是 Latin-1 中的合法值。在这种情况下,您可能希望返回 None ,因为这实际上是一个后备,并且您的代码可能希望更仔细地处理这个问题(如果可以的话)。

This answer is for Python 2

Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

夏见 2025-01-03 12:04:32

在 Python 3 中,这非常简单:读取文件并使用 utf-8 编码重写它:

s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:

s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)
嘦怹 2025-01-03 12:04:32
import codecs
import shutil
import sys

s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
    sys.stdout.write(s)

shutil.copyfileobj(sys.stdin, sys.stdout)
import codecs
import shutil
import sys

s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
    sys.stdout.write(s)

shutil.copyfileobj(sys.stdin, sys.stdout)
咿呀咿呀哟 2025-01-03 12:04:32

我发现这个问题是因为在打开带有 UTF8 BOM 标头的文件时遇到 configparser.ConfigParser().read(fp) 问题。

对于那些正在寻找解决方案来删除标头以便 ConfigPhaser 可以打开配置文件而不是报告以下错误的人:
文件不包含节标头,请按如下方式打开文件:

configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")

这样无需删除文件的 BOM 标头,可以为您节省大量精力。

(我知道这听起来无关,但希望这可以帮助像我一样苦苦挣扎的人。)

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.

For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:

configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")

This could save you tons of effort by making the remove of the BOM header of the file unnecessary.

(I know this sounds unrelated, but hopefully this could help people struggling like me.)

美人迟暮 2025-01-03 12:04:32

这是我的实现,将任何类型的编码转换为没有 BOM 的 UTF-8 并用通用格式替换 windows enlines:

def utf8_converter(file_path, universal_endline=True):
    '''
    Convert any type of file to UTF-8 without BOM
    and using universal endline by default.

    Parameters
    ----------
    file_path : string, file path.
    universal_endline : boolean (True),
                        by default convert endlines to universal format.
    '''

    # Fix file path
    file_path = os.path.realpath(os.path.expanduser(file_path))

    # Read from file
    file_open = open(file_path)
    raw = file_open.read()
    file_open.close()

    # Decode
    raw = raw.decode(chardet.detect(raw)['encoding'])
    # Remove windows end line
    if universal_endline:
        raw = raw.replace('\r\n', '\n')
    # Encode to UTF-8
    raw = raw.encode('utf8')
    # Remove BOM
    if raw.startswith(codecs.BOM_UTF8):
        raw = raw.replace(codecs.BOM_UTF8, '', 1)

    # Write to file
    file_open = open(file_path, 'w')
    file_open.write(raw)
    file_open.close()
    return 0

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:

def utf8_converter(file_path, universal_endline=True):
    '''
    Convert any type of file to UTF-8 without BOM
    and using universal endline by default.

    Parameters
    ----------
    file_path : string, file path.
    universal_endline : boolean (True),
                        by default convert endlines to universal format.
    '''

    # Fix file path
    file_path = os.path.realpath(os.path.expanduser(file_path))

    # Read from file
    file_open = open(file_path)
    raw = file_open.read()
    file_open.close()

    # Decode
    raw = raw.decode(chardet.detect(raw)['encoding'])
    # Remove windows end line
    if universal_endline:
        raw = raw.replace('\r\n', '\n')
    # Encode to UTF-8
    raw = raw.encode('utf8')
    # Remove BOM
    if raw.startswith(codecs.BOM_UTF8):
        raw = raw.replace(codecs.BOM_UTF8, '', 1)

    # Write to file
    file_open = open(file_path, 'w')
    file_open.write(raw)
    file_open.close()
    return 0
心奴独伤 2025-01-03 12:04:32

您可以使用编解码器。

import codecs
with open("test.txt",'r') as filehandle:
    content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
    content = content[3:]
print content.decode("utf-8")

You can use codecs.

import codecs
with open("test.txt",'r') as filehandle:
    content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
    content = content[3:]
print content.decode("utf-8")
離殇 2025-01-03 12:04:32

在 python3 中,您应该添加 encoding='utf-8-sig'

with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
    csvfile.writelines(rows)

就是这样。

In python3 you should add encoding='utf-8-sig':

with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
    csvfile.writelines(rows)

that's it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文