python 处理数据中的 XML 解析 expat
我正在尝试使用 python expat 解析 XML 文件。 我的 XML 文件中有以下行:
<Action><fail/></Action>
expat 标识开始和结束标记,但将 & 转换为 LT; 小于字符和大于字符相同,因此解析它如下:
结果:
START 'Action'
DATA '<'
DATA 'fail/'
DATA '>'
END 'Action'
而不是期望的:
START 'Action'
DATA '<fail/>'
END 'Action'
我想得到期望的结果,如何防止外籍人士搞砸?
I am attempting to parse an XML file using python expat. I have the following line in my XML file:
<Action><fail/></Action>
expat identifies the start and end tags but converts the & lt; to the less than character and the same for the greater than character and thus parses it like this:
outcome:
START 'Action'
DATA '<'
DATA 'fail/'
DATA '>'
END 'Action'
instead of the desired:
START 'Action'
DATA '<fail/>'
END 'Action'
I would like to have the desired outcome, how do I prevent expat from messing up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
expat 不会搞乱,
<
只是字符<
的 XML 编码。 恰恰相反,如果 expat 返回文字<
,这将是 XML 规范的一个错误。 话虽如此,您当然可以使用 xml.sax.saxutils.escape 取回转义版本:expat 解析器还可以自由地以任何看起来合适的块报告所有字符串数据,因此您必须自己连接它们。
expat does not mess up,
<
is simply the XML encoding for the character<
. Quite to the contrary, if expat would return the literal<
, this would be a bug with respect to the XML spec. That being said, you can of course get the escaped version back by usingxml.sax.saxutils.escape
:The expat parser is also free to report all string data in whatever chunks it seems fit, so you have to concatenate them yourself.
SAX 和 StAX 解析器都可以以任何方便的方式自由地分解字符串(尽管 StAX 有一个 COALESCE 模式来强制它为您组装各个部分)。
原因是,在某些情况下通常可以编写流式传输的软件,而不必关心重新组装字符串片段的开销。
通常我会在变量中累积文本,并在看到下一个 StartElement 或 EndElement 事件时使用内容。 此时,我还将累积文本变量重置为空。
Both SAX and StAX parsers are free to break up the strings in whatever way is convenient for them (although StAX has a COALESCE mode for forcing it to assemble the pieces for you).
The reason is that it is often possible to write software in certain cases that streams and doesn't have to care about the overhead of reassembling the string fragments.
Usually I accumulate text in a variable, and use the contents when I see the next StartElement or EndElement event. At that point, I also reset the accumulated-text variable to empty.