使用 lxml 解析文件后无法正确显示 unicode 字符串,但可以正常读取简单的文件

发布于 2025-01-05 19:36:54 字数 1290 浏览 1 评论 0原文

我正在尝试使用 lxml 模块来解析 HTML 文件,但正在努力让它处理一些 UTF-8 编码的数据。我在 Windows 上使用 Python 2.7。例如,考虑一个没有字节顺序标记的 UTF-8 编码文件,该文件只包含文本字符串 Québec。如果我只是使用常规文件处理程序读取文件内容并解码生成的字符串对象,我会得到一个长度为 6 的 unicode 字符串,当写回文件时该字符串看起来不错。但是,如果我使用 lxml 解析该文件,我会看到一个长度为 7 的 unicode 字符串,当写回文件时,该字符串看起来很奇怪。有人可以解释 lxml 发生的不同情况以及如何获取原始的漂亮字符串吗?

例如:

import lxml.html as html
from lxml import etree

f = open("output.txt", "w")

text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

output.txt 中生成输出:

String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec

EDIT

这里的部分解决方法似乎是使用以下方法解析文件:

etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

但是,据我所知基本 etree库缺少一些诸如选择器之类的便利类,因此允许我在没有 etree.HTMLParser() 的情况下使用 lxml.html 的解决方案仍然有用。

I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?

For example:

import lxml.html as html
from lxml import etree

f = open("output.txt", "w")

text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

Produces output in output.txt of:

String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec

EDIT

A partial workaround here seems to be to parse the file using:

etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

or

html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦幻之岛 2025-01-12 19:36:54

函数 lxml.html.parse already 使用 lxml.html.HTMLParser 的实例,因此您不应该反对使用它

html.parse("input.txt", html.HTMLParser(encoding="utf-8"))

来处理 utf-8 数据

The function lxml.html.parse already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to using

html.parse("input.txt", html.HTMLParser(encoding="utf-8"))

to handle the utf-8 data

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文