urllib2/lxml 编码问题

发布于 2024-10-07 02:04:32 字数 293 浏览 3 评论 0原文

我是 python 新手，尝试使用 urllib2/lxml 来获取和解析页面。一切似乎都工作正常，除了解析的页面在我的浏览器中打开时似乎嵌入了奇怪的字符。我猜这是一个 unicode/lxml 解析问题。当我使用 .text_content() 获取元素的文本内容并打印它时，我在原始页面中得到如下内容：“sometext \342\200\223 moretext”，这显示为“sometext - moretext

”告诉我：
1.发生什么事了？
2.如何解决？
3. 我在哪里可以阅读此类编码问题？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别闹i 2024-10-14 02:04:32

发生的事情是，该网站正在使用“endash”，这是一个稍长的破折号（您应该在范围内使用破折号，例如 40-56，真的。是的，破折号本身就是一门完整的科学）。

在 Unicode 中，endash 的代码点为 U+2013。您得到的数字 \342\200\223 是该代码点的 UTF-8 编码的八进制表示形式。为什么你得到八进制我不知道，我得到十六进制，所以在我的计算机上它看起来像'\xe2\x80\x93'。但这没有什么区别，这只是表示形式。数字是相同的。

您可能应该做的是尽早将获得的 HTML 字符串解码为 unicode。获取页面时返回的标头应该告诉您它使用的编码（尽管这里显然是 UTF8），从标头中提取该数据相当容易，当您打印标头时您会看到它。

然后解码 html 数据：

htmldata = htmldata.decode(<the encoding you found in the headers>)

What is going on is that the website is using an "endash", which is a slightly longer dash (and the one you should use in ranges, like 40-56, really. Yeah, dashes is a whole science unto itself).

In Unicode, the endash has codepoint U+2013. The numbers you get, \342\200\223 is the octal representation of the UTF-8 encoding of that codepoint. Why you get octal I don't know, I get hex, so on my computer it looks like '\xe2\x80\x93'. But that makes no difference, that's just the respresentation. The numbers are the same.

What you probably should do is to decode the HTML string you get to unicode as early as possible. The headers you get back when you fetch the page should tell you what encoding it uses (although it's apparently UTF8 here), it's fairly easy to extract that data from the headers, you'll see it when you print out the headers.

You then decode the html data:

htmldata = htmldata.decode(<the encoding you found in the headers>)

回复收藏 0 原文

萌无敌 2024-10-14 02:04:32

您主要需要注意在此过程中两个点上的 unicode 问题：

将响应获取到 unicode 字符串中，很好地解释了这里
输出字符串时指定合适的编码

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')

You'll mainly need to be mindful of unicode issues at two points in the process:

Get the response into a unicode string, nicely explained here on SO
Specify a suitable encoding when outputting strings

#  from an lxml etree
etree.tostring(root, encoding='utf-8', xml_declaration=False)

# from a unicode string
x.encode('utf-8')

回复收藏 0 原文

~没有更多了~

关于作者

非要怀念

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

urllib2/lxml 编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

urllib2/lxml 编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。