如何在Python中解析GB2312编码的RSS
我有一个以 GB2312 编码的 RSS 提要
当我尝试使用以下代码解析它时:
for item in XML.ElementFromURL(feed).xpath('//item'):
title = item.find('title').text
它无法解析提要。
任何想法如何解析 GB2312 编码的 RSS feed
使用如下编码后,来自 Plex Media Server 的错误日志如下
for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
title = item.find('title').text
:
***Error Log:***
> File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
return self._core.data.xml.from_string(string, isHTML)
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
return etree.fromstring(markup)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36
2011-09-28 09:34:33,453 (9d0) : DEBUG (core) - Response: 404
I have a RSS feed shich is encoded in GB2312
When I am trying to parse it using following code:
for item in XML.ElementFromURL(feed).xpath('//item'):
title = item.find('title').text
It is not able to parse the Feed.
Any Idea how to parse GB2312 encoded RSS feed
The error Log from Plex Media Server is below after using encoding as below
for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
title = item.find('title').text
:
***Error Log:***
> File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
return self._core.data.xml.from_string(string, isHTML)
File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
return etree.fromstring(markup)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36
2011-09-28 09:34:33,453 (9d0) : DEBUG (core) - Response: 404
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的错误消息是
XMLSyntaxError: Switching Encoding:编码器错误,第 1 行,第 36 列
。你征求意见。这里有一个新颖的想法:告诉我们“第 1 行”的前 50 个左右字节中有什么。那么有人也许能够想出补救办法。更新:编码声明不正确。数据未采用
gb2312
编码。至少是 GBK 又名 cp936。 GB2312-80(1980 年为 80)是一个有限的字符集。不使用 UTF-8 的中文网站将至少使用超集 GBK(已经使用了 10 多年)并转向超超集 GB18030(它本身就是 UTF)。请参阅下文:我建议您尝试
XML.ElementFromURL(feed,encoding='gbk')
。如果可行,您可能希望通过使用 urllib 读取数据,检查
gb2312
,如果找到它,则使用gb18030
来防止您的代码遇到这种常见问题。 > 相反。更新2:如果有人提到
chardet
:由于GBK使用了GB2312中的许多未使用的插槽,并且chardet不适用于实际使用的插槽,并且没有尝试通过尝试解码来验证其答案, Charget猜测GB2312。Your error message is
XMLSyntaxError: switching encoding: encoder error, line 1, column 36
. You asked for ideas. Here's a novel idea: Tell us what is in the first 50 or so bytes of "line 1". Then somebody may be able to come up with a remedy.Update: The encoding declaration is incorrect. The data is NOT encoded in
gb2312
. It's at least GBK aka cp936. GB2312-80 (that's 80 as in the year 1980) is a limited character set. Chinese websites that are not using UTF-8 would be using at least the superset GBK (been in use for well over 10 years) and moving to the supersuperset GB18030 (which is itself a UTF). See below:I suggest that you try
XML.ElementFromURL(feed, encoding='gbk')
.If that works, you may wish to bullet-proof your code against this not-uncommon problem by reading the data with urllib, checking for
gb2312
and if you find it, usegb18030
instead.Update 2: In case anyone mentions
chardet
: due to GBK using the many unused slots in GB2312, and chardet not working on actually-used slots, and not attempting to verify its answer by doing a trial decode, charget guesses GB2312.我假设您正在使用 Plex XML API。该文档指出,如果您知道这确实是正在使用的编码,则可以调用
XML.ElementFromURL(feed,encoding='gb2312')
。如果 XML 确实使用 GB2312 编码,则声明必须为
(或者以字节顺序标记开头,对于 UTF- 16),否则XML 无效。如果 XML 声明中没有
encoding
,并且没有字节顺序标记,则解析器必须默认采用 UTF-8 编码,因此在没有的情况下对 XML 使用任何其他字符编码都是无效的>编码
在声明中。由于不指定编码会产生错误,因此我认为 RSS 提要可能不是有效的 XML。I assume you are using the Plex XML API. The documentation states that you can call
XML.ElementFromURL(feed, encoding='gb2312')
if you know that this is really the encoding being used.If the XML really is encoded with GB2312, then the declaration must be
<?xml version="1.0" encoding="gb2312"?>
(or begin with a byte order mark, for UTF-16), otherwise the XML is invalid. If there is noencoding
in the XML declaration, and no byte order mark, parsers must assume UTF-8 encoding by default, and therefore it is invalid to use any other character encoding for XML without anencoding
in the declaration. Since not specifying the encoding produces an error for you, I think it is possible that the RSS feed is not valid XML.