如何在Python中解析GB2312编码的RSS

发布于 2024-12-06 14:51:20 字数 2493 浏览 8 评论 0原文

我有一个以 GB2312 编码的 RSS 提要

当我尝试使用以下代码解析它时：

for item in XML.ElementFromURL(feed).xpath('//item'):
    title = item.find('title').text

它无法解析提要。

任何想法如何解析 GB2312 编码的 RSS feed

使用如下编码后，来自 Plex Media Server 的错误日志如下

for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
        title = item.find('title').text

：

***Error Log:***
>  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
    for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
    return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
    return self._core.data.xml.from_string(string, isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
    return etree.fromstring(markup)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36

2011-09-28 09:34:33,453 (9d0) :  DEBUG (core) - Response: 404

原文

I have a RSS feed shich is encoded in GB2312

When I am trying to parse it using following code:

for item in XML.ElementFromURL(feed).xpath('//item'):
    title = item.find('title').text

It is not able to parse the Feed.

Any Idea how to parse GB2312 encoded RSS feed

The error Log from Plex Media Server is below after using encoding as below

for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
        title = item.find('title').text

***Error Log:***
>  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
    for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
    return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
    return self._core.data.xml.from_string(string, isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
    return etree.fromstring(markup)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36

2011-09-28 09:34:33,453 (9d0) :  DEBUG (core) - Response: 404

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你在我安 2024-12-13 14:51:20

您的错误消息是 XMLSyntaxError: Switching Encoding:编码器错误，第 1 行，第 36 列。你征求意见。这里有一个新颖的想法：告诉我们“第 1 行”的前 50 个左右字节中有什么。那么有人也许能够想出补救办法。

更新：编码声明不正确。数据未采用 gb2312 编码。至少是 GBK 又名 cp936。 GB2312-80（1980 年为 80）是一个有限的字符集。不使用 UTF-8 的中文网站将至少使用超集 GBK（已经使用了 10 多年）并转向超超集 GB18030（它本身就是 UTF）。请参阅下文：

[Python 2.7.1]
>>> import urllib
>>> url = "http://www.zaobao.com/sp/sp.xml"
>>> data = urllib.urlopen(url).read()
>>> len(data)
10071
>>> data[:100]
'<?xml version="1.0" encoding="GB2312"?>\n\n<rss version="2.0"\n>\n\n<channel>\n<title>\xc1\xaa\xba\xcf\xd4\xe7\xb1\xa8\xcd\xf8 zaobao.co'
>>> x = data.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 1771-1772: illegal multibyte sequence
>>> data[1771:1773]
'\x95N'
>>> x = data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 80: invalid start byte
>>> x = data.decode('gbk')
>>> y = data.decode('cp936')
>>> x == y
True

我建议您尝试 XML.ElementFromURL(feed,encoding='gbk')。

如果可行，您可能希望通过使用 urllib 读取数据，检查 gb2312，如果找到它，则使用 gb18030 来防止您的代码遇到这种常见问题。 > 相反。

更新2：如果有人提到chardet：由于GBK使用了GB2312中的许多未使用的插槽，并且chardet不适用于实际使用的插槽，并且没有尝试通过尝试解码来验证其答案， Charget猜测GB2312。

Your error message is XMLSyntaxError: switching encoding: encoder error, line 1, column 36. You asked for ideas. Here's a novel idea: Tell us what is in the first 50 or so bytes of "line 1". Then somebody may be able to come up with a remedy.

Update: The encoding declaration is incorrect. The data is NOT encoded in gb2312. It's at least GBK aka cp936. GB2312-80 (that's 80 as in the year 1980) is a limited character set. Chinese websites that are not using UTF-8 would be using at least the superset GBK (been in use for well over 10 years) and moving to the supersuperset GB18030 (which is itself a UTF). See below:

[Python 2.7.1]
>>> import urllib
>>> url = "http://www.zaobao.com/sp/sp.xml"
>>> data = urllib.urlopen(url).read()
>>> len(data)
10071
>>> data[:100]
'<?xml version="1.0" encoding="GB2312"?>\n\n<rss version="2.0"\n>\n\n<channel>\n<title>\xc1\xaa\xba\xcf\xd4\xe7\xb1\xa8\xcd\xf8 zaobao.co'
>>> x = data.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 1771-1772: illegal multibyte sequence
>>> data[1771:1773]
'\x95N'
>>> x = data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 80: invalid start byte
>>> x = data.decode('gbk')
>>> y = data.decode('cp936')
>>> x == y
True

I suggest that you try XML.ElementFromURL(feed, encoding='gbk').

If that works, you may wish to bullet-proof your code against this not-uncommon problem by reading the data with urllib, checking for gb2312 and if you find it, use gb18030 instead.

Update 2: In case anyone mentions chardet: due to GBK using the many unused slots in GB2312, and chardet not working on actually-used slots, and not attempting to verify its answer by doing a trial decode, charget guesses GB2312.

回复收藏 0 原文

夏见 2024-12-13 14:51:20

我假设您正在使用 Plex XML API。该文档指出，如果您知道这确实是正在使用的编码，则可以调用 XML.ElementFromURL(feed,encoding='gb2312') 。

如果 XML 确实使用 GB2312 编码，则声明必须为（或者以字节顺序标记开头，对于 UTF- 16)，否则XML 无效。如果 XML 声明中没有 encoding，并且没有字节顺序标记，则解析器必须默认采用 UTF-8 编码，因此在没有 的情况下对 XML 使用任何其他字符编码都是无效的>编码在声明中。由于不指定编码会产生错误，因此我认为 RSS 提要可能不是有效的 XML。