蟒蛇 +外籍人士：� 上出现错误实体

发布于 2024-09-06 06:22:17 字数 2025 浏览 11 评论 0原文

我编写了一个小函数，它使用 ElementTree 和 xpath 提取 xml 文件中某些元素的文本内容：

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>&#0;</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

第三次测试失败，但出现以下异常：

ExpatError: 引用无效字符数：第 1 行，第 13 列

是 < code>� 实体非法 XML？不管是否，我想要解析的文件都包含它，我需要某种方法来解析它们。对于除 Expat 之外的其他解析器或 Expat 设置有什么建议，可以让我做到这一点吗？

更新：我刚刚发现 BeautifulSoup ，这是一个标签汤解析器，如下面答案评论中所述，为了好玩，我回到了这个问题，并尝试在 ElementTree 之前将其用作 XML 清理器，但它尽职尽责地将  转换为无效的空字节。 :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>&#0;</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... 产量

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

在我的特殊情况下，我并不真正需要 XPath 解析，我可以使用 BeautifulSoup 本身及其相当简单的节点寻址风格 parsed_tree.test.elem1.contents[ 0]。

原文

I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>�</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

The third test fails with the following exception:

ExpatError: reference to invalid character number: line 1, column 13

Is the � entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?

Update: I discovered BeautifulSoup just now, a tag soup parser as noted below in the answer comment, and for fun I went back to this problem and tried to use it as an XML-cleaner in front of ElementTree, but it dutifully converted the � into a just-as-invalid null byte. :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>�</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... yields

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

In my particular case though, I didn't really need the XPath parsing as such, I could have gone with BeautifulSoup itself and its quite simple node adressing style parsed_tree.test.elem1.contents[0].

分享到QQ

分享到微博