python 报告 XML 节点的原始行/列
我目前正在使用 xml.dom.minidom 来解析 python 中的一些 XML。解析后,我正在对内容进行一些报告,并希望报告源 XML 文档中标记开始的行(和列),但我不知道这是怎么可能的。
如果可能的话,我想坚持使用 xml.dom / xml.dom.minidom,但如果我需要使用 SAX 解析器来获取原始信息,我可以这样做——在这种情况下,理想的情况是使用 SAX 来跟踪节点位置,但最终仍然会得到一个用于我的后处理的 DOM。
关于如何执行此操作有什么建议吗?希望我只是忽略了文档中的一些内容,这非常简单。
I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.
I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.
Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通过猴子修补 minidom 内容处理程序,我能够记录每个节点的行号和列号(作为“parse_position”属性)。这有点脏,但我看不到任何“官方批准”的做法:)这是我的测试脚本:
它输出以下内容:
By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:
It outputs the following:
解决该问题的另一种方法是在解析文档之前将行号信息修补到文档中。想法是这样的:
然后您可以检索元素的行号
。 很明显,这种方法有其自身的缺点,如果您确实也需要列号,则修补它会更加复杂。另外,如果您想提取文本节点或注释或使用 Node.toXml(),则必须确保从任何意外匹配中删除 LINE_DUMMY_ATTR。
与 aknuds1 的答案相比,该解决方案的一个优点是它不需要弄乱 minidom 内部结构。
A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:
Then you can retrieve the line number of an element with
Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use
Node.toXml()
, you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.