使用 lxml 解析文件后无法正确显示 unicode 字符串,但可以正常读取简单的文件
我正在尝试使用 lxml 模块来解析 HTML 文件,但正在努力让它处理一些 UTF-8 编码的数据。我在 Windows 上使用 Python 2.7。例如,考虑一个没有字节顺序标记的 UTF-8 编码文件,该文件只包含文本字符串 Québec
。如果我只是使用常规文件处理程序读取文件内容并解码生成的字符串对象,我会得到一个长度为 6 的 unicode 字符串,当写回文件时该字符串看起来不错。但是,如果我使用 lxml 解析该文件,我会看到一个长度为 7 的 unicode 字符串,当写回文件时,该字符串看起来很奇怪。有人可以解释 lxml 发生的不同情况以及如何获取原始的漂亮字符串吗?
例如:
import lxml.html as html
from lxml import etree
f = open("output.txt", "w")
text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
在 output.txt
中生成输出:
String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec
EDIT
这里的部分解决方法似乎是使用以下方法解析文件:
etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
或
html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
但是,据我所知基本 etree库缺少一些诸如选择器之类的便利类,因此允许我在没有 etree.HTMLParser() 的情况下使用 lxml.html 的解决方案仍然有用。
I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec
. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?
For example:
import lxml.html as html
from lxml import etree
f = open("output.txt", "w")
text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
Produces output in output.txt
of:
String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec
EDIT
A partial workaround here seems to be to parse the file using:
etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
or
html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
函数
lxml.html.parse
already 使用 lxml.html.HTMLParser 的实例,因此您不应该反对使用它来处理 utf-8 数据
The function
lxml.html.parse
already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to usingto handle the utf-8 data