带有 BOM 的 UTF-8 HTML 和 CSS 文件(以及如何使用 Python 删除 BOM)

发布于 2024-08-25 17:35:54 字数 617 浏览 10 评论 0原文

首先,一些背景知识:我正在使用 Python 开发一个 Web 应用程序。我的所有(文本)文件当前都以带有 BOM 的 UTF-8 格式存储。这包括我所有的 HTML 模板和 CSS 文件。这些资源作为二进制数据(BOM 和所有)存储在我的数据库中。

当我从数据库检索模板时,我使用 template.decode('utf-8') 对其进行解码。当 HTML 到达浏览器时,BOM 出现在 HTTP 响应正文的开头。这会在 Chrome 中生成一个非常有趣的错误:

Extra ;遭遇。将属性迁移回原始的 元素并忽略标签。

Chrome 似乎在看到 BOM 并将其误认为内容时自动生成 标签,从而使真正的 < /code> 标记错误。

那么,使用 Python,从我的 UTF-8 编码模板中删除 BOM 的最佳方法是什么(如果存在 - 我不能保证将来会这样做)?

对于 CSS 等其他基于文本的文件,主流浏览器能否正确解释(或忽略)BOM?它们以纯二进制数据形式发送,无需 .decode('utf-8')

注意:我使用的是Python 2.5。

谢谢!

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

Note: I am using Python 2.5.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

素食主义者 2024-09-01 17:35:55

解码后检查第一个字符是否为BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]

Check the first character after decoding to see if it's the BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]
淡水深流 2024-09-01 17:35:55

之前接受的答案是错误的。

u'\ufffe' 不是一个字符。如果你把它放在一个 unicode 字符串中,那么有人已经把它塞满了。

BOM(又名零宽度无中断空间)是 u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

阅读 (Ctrl-F 搜索 BOM)和 (Ctrl-F 搜索 BOM)。

这是一个正确且防错字/脑力的答案:

将您的输入解码为 unicode_str。然后这样做:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

额外的好处:使用命名常量可以让读者比一组看似任意的六角形文字更多地了解正在发生的事情。

更新 不幸的是,标准Python库中似乎没有合适的命名常量。

唉,编解码器模块只提供了“陷阱和错觉”:

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

更新 2 如果您尚未对输入进行解码,并希望检查其 BOM,则需要检查 对于 UTF-16,有两种不同的 BOM;对于 UTF-32,至少有两种不同的 BOM。如果每种方法只有一种,那么您就不需要 BOM,不是吗?

这里逐字未修饰的我自己的代码是我的解决方案:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

输入 s 应该至少是输入的前 4 个字节。它返回可用于解码输入的 BOM 后部分的编码,以及 BOM 的长度(如果有)。

如果你偏执,你可以允许另外 2 个(非标准)UTF-32 排序,但 Python 不提供它们的编码,而且我从未听说过实际发生的情况,所以我不打扰。

The previously-accepted answer is WRONG.

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

Here's a correct and typo/braino-resistant answer:

Decode your input into unicode_str. Then do this:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

Alas, the codecs module provides only "a snare and a delusion":

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

Here verbatim unprettified from my own code is my solution to this:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

云巢 2024-09-01 17:35:55

您可以使用类似的方法来删除 BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()

You can use something similar to remove BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()
∞琼窗梦回ˉ 2024-09-01 17:35:54

既然你说:

我的所有(文本)文件目前都在
以 UTF-8 格式存储,带 BOM

,然后使用“utf-8-sig”编解码器对其进行解码:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

它会自动删除预期的 BOM,并且如果 BOM 不存在也可以正常工作。

Since you state:

All of my (text) files are currently
stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文