字符编码被违反

发布于 2024-09-24 12:35:41 字数 1856 浏览 3 评论 0原文

我正在尝试解析以 utf-8 编码的文件。除了写入文件之外，没有任何操作有问题（或者至少我这么认为）。一个最小的工作示例如下：

from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')

example.txt：

<html>
    <body>
        <invalid html here/>
        <interesting attrib1="yes">
            <group>
                <line>
                    δεδομένα1
                </line>
            </group>
            <group>
                <line>
                    δεδομένα2
                </line>
            </group>
            <group>
                <line>
                    δεδομένα3
                </line>
            </group>
        </interesting>
    </body>
</html>

我已经知道类似的先前问题但是如果不指定输出编码，或者使用 utf8 或 iso-8859-7，我无法解决该问题。

我得出的结论是该文件采用 utf8 格式，因为选择此编码时它可以在 Chrome 上正确显示。我的编辑（凯特）同意这一点。

我没有收到运行时错误，但输出不符合预期。 tree.write('aaaaaaaaaaaaaaaaaa.html',encoding='utf-8') 的示例输出：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
        <invalid html="" here=""/><interesting attrib1="yes"><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±1
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±2
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±3
                </line></group></interesting></body></html>

原文

I am trying to parse a file encoded in utf-8. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:

from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')

example.txt:

<html>
    <body>
        <invalid html here/>
        <interesting attrib1="yes">
            <group>
                <line>
                    δεδομένα1
                </line>
            </group>
            <group>
                <line>
                    δεδομένα2
                </line>
            </group>
            <group>
                <line>
                    δεδομένα3
                </line>
            </group>
        </interesting>
    </body>
</html>

I am already aware of a similar previous question but I could not solve the problem either without specifying the output encoding, or using utf8 or iso-8859-7.

I have concluded that the file is in utf8 since it displays correctly at Chrome when choosing this encoding. My editor (Kate) agrees.

I get no runtime error, but the output is not as desired.
Example output with tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8'):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
        <invalid html="" here=""/><interesting attrib1="yes"><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±1
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±2
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±3
                </line></group></interesting></body></html>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兲鉂ぱ嘚淚 2024-10-01 12:35:41

明显的问题是 HTMLParser 默认将输入文件视为 ANSI，即 UTF-8 字节被误解为 8 位字符代码。您可以简单地传递编码来解决此问题：

parser = etree.HTMLParser(encoding = "utf-8")

如果您想检查我的误解是什么意思，请让 Python 打印 repr(tree.xpath("//line")[0].text)有或没有 HTMLParser 的 encoding 参数。

The obvious problem is that HTMLParser treats the input file as ANSI by default, i.e. the UTF-8 bytes are misinterpreted as 8-bit character codes. You can simply pass the encoding to fix this:

parser = etree.HTMLParser(encoding = "utf-8")

If you want to check what I meant with the misinterpretation, let Python print repr(tree.xpath("//line")[0].text) with and without HTMLParser's encoding parameter.

回复收藏 0 原文

~没有更多了~