字符编码被违反
我正在尝试解析以 utf-8
编码的文件。除了写入文件之外,没有任何操作有问题(或者至少我这么认为)。一个最小的工作示例如下:
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')
example.txt:
<html>
<body>
<invalid html here/>
<interesting attrib1="yes">
<group>
<line>
δεδομένα1
</line>
</group>
<group>
<line>
δεδομένα2
</line>
</group>
<group>
<line>
δεδομένα3
</line>
</group>
</interesting>
</body>
</html>
我已经知道类似的先前问题但是如果不指定输出编码,或者使用 utf8
或 iso-8859-7
,我无法解决该问题。
我得出的结论是该文件采用 utf8
格式,因为选择此编码时它可以在 Chrome 上正确显示。我的编辑(凯特)同意这一点。
我没有收到运行时错误,但输出不符合预期。 tree.write('aaaaaaaaaaaaaaaaaa.html',encoding='utf-8')
的示例输出:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<invalid html="" here=""/><interesting attrib1="yes"><group><line>
δεδομÎνα1
</line></group><group><line>
δεδομÎνα2
</line></group><group><line>
δεδομÎνα3
</line></group></interesting></body></html>
I am trying to parse a file encoded in utf-8
. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')
example.txt:
<html>
<body>
<invalid html here/>
<interesting attrib1="yes">
<group>
<line>
δεδομένα1
</line>
</group>
<group>
<line>
δεδομένα2
</line>
</group>
<group>
<line>
δεδομένα3
</line>
</group>
</interesting>
</body>
</html>
I am already aware of a similar previous question but I could not solve the problem either without specifying the output encoding, or using utf8
or iso-8859-7
.
I have concluded that the file is in utf8
since it displays correctly at Chrome when choosing this encoding. My editor (Kate) agrees.
I get no runtime error, but the output is not as desired.
Example output with tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8')
:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<invalid html="" here=""/><interesting attrib1="yes"><group><line>
δεδομÎνα1
</line></group><group><line>
δεδομÎνα2
</line></group><group><line>
δεδομÎνα3
</line></group></interesting></body></html>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
明显的问题是 HTMLParser 默认将输入文件视为 ANSI,即 UTF-8 字节被误解为 8 位字符代码。您可以简单地传递编码来解决此问题:
如果您想检查我的误解是什么意思,请让 Python 打印
repr(tree.xpath("//line")[0].text)
有或没有 HTMLParser 的encoding
参数。The obvious problem is that HTMLParser treats the input file as ANSI by default, i.e. the UTF-8 bytes are misinterpreted as 8-bit character codes. You can simply pass the encoding to fix this:
If you want to check what I meant with the misinterpretation, let Python print
repr(tree.xpath("//line")[0].text)
with and without HTMLParser'sencoding
parameter.