urllib2/lxml 编码问题

发布于 2024-10-07 02:04:32 字数 293 浏览 3 评论 0原文

我是 python 新手,尝试使用 urllib2/lxml 来获取和解析页面。一切似乎都工作正常,除了解析的页面在我的浏览器中打开时似乎嵌入了奇怪的字符。我猜这是一个 unicode/lxml 解析问题。当我使用 .text_content() 获取元素的文本内容并打印它时,我在原始页面中得到如下内容:“sometext \342\200\223 moretext”,这显示为“sometext - moretext

”告诉我:
1.发生什么事了?
2.如何解决?
3. 我在哪里可以阅读此类编码问题?

谢谢!

I'm new to python, and trying to use urllib2/lxml to fetch, and parse a page. Everything seems to work fine, except, the parsed page, when opened in my browser seems to have strange characters embedded in it. I'm guessing this is a unicode/lxml parsing problem. When I get the text content of an element, using .text_content(), and print it, I get stuff like: "sometext \342\200\223 moretext" in the original page, this shows as "sometext - moretext"

Could anyone tell me:
1. what's going on?
2. how do I fix it?
3. where can I read up on encoding issues like these?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

别闹i 2024-10-14 02:04:32

发生的事情是,该网站正在使用“endash”,这是一个稍长的破折号(您应该在范围内使用破折号,例如 40-56,真的。是的,破折号本身就是一门完整的科学)。

在 Unicode 中,endash 的代码点为 U+2013。您得到的数字 \342\200\223 是该代码点的 UTF-8 编码的八进制表示形式。为什么你得到八进制我不知道,我得到十六进制,所以在我的计算机上它看起来像'\xe2\x80\x93'。但这没有什么区别,这只是表示形式。数字是相同的。

您可能应该做的是尽早将获得的 HTML 字符串解码为 un​​icode。获取页面时返回的标头应该告诉您它使用的编码(尽管这里显然是 UTF8),从标头中提取该数据相当容易,当您打印标头时您会看到它。

然后解码 html 数据:

htmldata = htmldata.decode(<the encoding you found in the headers>)

What is going on is that the website is using an "endash", which is a slightly longer dash (and the one you should use in ranges, like 40-56, really. Yeah, dashes is a whole science unto itself).

In Unicode, the endash has codepoint U+2013. The numbers you get, \342\200\223 is the octal representation of the UTF-8 encoding of that codepoint. Why you get octal I don't know, I get hex, so on my computer it looks like '\xe2\x80\x93'. But that makes no difference, that's just the respresentation. The numbers are the same.

What you probably should do is to decode the HTML string you get to unicode as early as possible. The headers you get back when you fetch the page should tell you what encoding it uses (although it's apparently UTF8 here), it's fairly easy to extract that data from the headers, you'll see it when you print out the headers.

You then decode the html data:

htmldata = htmldata.decode(<the encoding you found in the headers>)
萌无敌 2024-10-14 02:04:32

您主要需要注意在此过程中两个点上的 unicode 问题:

  1. 将响应获取到 unicode 字符串中,很好地解释了这里
  2. 输出字符串时指定合适的编码

--

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')

You'll mainly need to be mindful of unicode issues at two points in the process:

  1. Get the response into a unicode string, nicely explained here on SO
  2. Specify a suitable encoding when outputting strings

--

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文