蟒蛇 +外籍人士:� 上出现错误实体

发布于 2024-09-06 06:22:17 字数 2025 浏览 6 评论 0原文

我编写了一个小函数,它使用 ElementTree 和 xpath 提取 xml 文件中某些元素的文本内容:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>&#0;</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

第三次测试失败,但出现以下异常:

ExpatError: 引用无效字符数:第 1 行,第 13 列

是 < code>� 实体非法 XML?不管是否,我想要解析的文件都包含它,我需要某种方法来解析它们。对于除 Expat 之外的其他解析器或 Expat 设置有什么建议,可以让我做到这一点吗?


更新:我刚刚发现 BeautifulSoup ,这是一个标签汤解析器,如下面答案评论中所述,为了好玩,我回到了这个问题,并尝试在 ElementTree 之前将其用作 XML 清理器,但它尽职尽责地将 &#0; 转换为无效的空字节。 :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>&#0;</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... 产量

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

在我的特殊情况下,我并不真正需要 XPath 解析,我可以使用 BeautifulSoup 本身及其相当简单的节点寻址风格 parsed_tree.test.elem1.contents[ 0]

I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>�</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

The third test fails with the following exception:

ExpatError: reference to invalid character number: line 1, column 13

Is the entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?


Update: I discovered BeautifulSoup just now, a tag soup parser as noted below in the answer comment, and for fun I went back to this problem and tried to use it as an XML-cleaner in front of ElementTree, but it dutifully converted the into a just-as-invalid null byte. :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>�</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... yields

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

In my particular case though, I didn't really need the XPath parsing as such, I could have gone with BeautifulSoup itself and its quite simple node adressing style parsed_tree.test.elem1.contents[0].

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

醉南桥 2024-09-13 06:22:17

is not in the legal character range defined by the XML spec. Alas, my Python skills are pretty rudimentary, so I'm not much help there.

蓝梦月影 2024-09-13 06:22:17

不是有效的 XML 字符。理想情况下,您能够让文件的创建者更改他们的流程,以便文件不会像这样无效。

如果您必须接受这些文件,您可以对它们进行预处理,将 转换为其他文件。例如,选择@作为转义字符,将“@”变为“@@”,将“”变为“@0”。

然后,当您从解析器获取文本数据时,您可以反转映射。这只是一个示例,您可以发明任何您喜欢的转义语法。

is not a valid XML character. Ideally, you'd be able to get the creator of the file to change their process so that the file was not invalid like this.

If you must accept these files, you could pre-process them to turn into something else. For example, pick @ as an escape character, turn "@" into "@@", and "" into "@0".

Then as you get the text data from the parser, you can reverse the mapping. This is just an example, you can invent any escaping syntax you like.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文