使用 lxml 解析 RSS 时出现编码错误
我想用lxml解析下载的RSS,但我不知道如何处理UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
但我收到一个错误:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
But I get an error:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我遇到了类似的问题,事实证明这与编码无关。发生的事情是这样的 - lxml 向您抛出一个完全不相关的错误。在这种情况下,错误在于 .parse 函数需要文件名或 URL,而不是包含内容本身的字符串。然而,当它尝试打印出错误时,它会因非 ASCII 字符而卡住,并显示出完全令人困惑的错误消息。非常不幸的是,其他人在这里对此问题发表了评论:
https ://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html
幸运的是,你的是一个非常简单的修复。只需将 .parse 替换为 .fromstring 就可以了:
刚刚在我的机器上测试了它,它工作得很好。希望有帮助!
I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:
https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html
Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:
Just tested this on my machine and it worked fine. Hope it helps!
首先为 lxml 库加载和整理字符串,然后在其上调用 fromstring 通常更容易,而不是依赖 lxml.etree.parse() 函数及其难以管理的编码选项。
这个特定的 rss 文件以编码声明开头,因此一切都应该正常工作:
以下代码显示了一些不同的变体,您可以应用这些变体来使 etree 解析不同的编码。您还可以请求它也写出不同的编码,这些编码将出现在标头中。
代码可以在这里尝试:
http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#
It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.
This particular rss file begins with the encoding declaration, so everything should just work:
The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.
Code can be tried here:
http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#
您可能应该只尝试将字符编码定义为最后的手段,因为很清楚编码是基于 XML 序言(如果不是通过 HTTP 标头)。无论如何,没有必要将编码传递给 etree .XMLParser 除非你想覆盖编码;所以去掉
encoding
参数,它应该可以工作。编辑:好的,问题实际上似乎出在
lxml
上。无论出于何种原因,以下工作都有效:You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to
etree.XMLParser
unless you want to override the encoding; so get rid of theencoding
parameter and it should work.Edit: okay, the problem actually seems to be with
lxml
. The following works, for whatever reason: