urllib2/lxml 编码问题
我是 python 新手,尝试使用 urllib2/lxml 来获取和解析页面。一切似乎都工作正常,除了解析的页面在我的浏览器中打开时似乎嵌入了奇怪的字符。我猜这是一个 unicode/lxml 解析问题。当我使用 .text_content() 获取元素的文本内容并打印它时,我在原始页面中得到如下内容:“sometext \342\200\223 moretext”,这显示为“sometext - moretext
”告诉我:
1.发生什么事了?
2.如何解决?
3. 我在哪里可以阅读此类编码问题?
谢谢!
I'm new to python, and trying to use urllib2/lxml to fetch, and parse a page. Everything seems to work fine, except, the parsed page, when opened in my browser seems to have strange characters embedded in it. I'm guessing this is a unicode/lxml parsing problem. When I get the text content of an element, using .text_content(), and print it, I get stuff like: "sometext \342\200\223 moretext" in the original page, this shows as "sometext - moretext"
Could anyone tell me:
1. what's going on?
2. how do I fix it?
3. where can I read up on encoding issues like these?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
发生的事情是,该网站正在使用“endash”,这是一个稍长的破折号(您应该在范围内使用破折号,例如 40-56,真的。是的,破折号本身就是一门完整的科学)。
在 Unicode 中,endash 的代码点为 U+2013。您得到的数字 \342\200\223 是该代码点的 UTF-8 编码的八进制表示形式。为什么你得到八进制我不知道,我得到十六进制,所以在我的计算机上它看起来像'\xe2\x80\x93'。但这没有什么区别,这只是表示形式。数字是相同的。
您可能应该做的是尽早将获得的 HTML 字符串解码为 unicode。获取页面时返回的标头应该告诉您它使用的编码(尽管这里显然是 UTF8),从标头中提取该数据相当容易,当您打印标头时您会看到它。
然后解码 html 数据:
What is going on is that the website is using an "endash", which is a slightly longer dash (and the one you should use in ranges, like 40-56, really. Yeah, dashes is a whole science unto itself).
In Unicode, the endash has codepoint U+2013. The numbers you get, \342\200\223 is the octal representation of the UTF-8 encoding of that codepoint. Why you get octal I don't know, I get hex, so on my computer it looks like '\xe2\x80\x93'. But that makes no difference, that's just the respresentation. The numbers are the same.
What you probably should do is to decode the HTML string you get to unicode as early as possible. The headers you get back when you fetch the page should tell you what encoding it uses (although it's apparently UTF8 here), it's fairly easy to extract that data from the headers, you'll see it when you print out the headers.
You then decode the html data:
您主要需要注意在此过程中两个点上的 unicode 问题:
--
You'll mainly need to be mindful of unicode issues at two points in the process:
--