python 报告 XML 节点的原始行/列

发布于 2024-10-13 19:41:16 字数 286 浏览 2 评论 0原文

我目前正在使用 xml.dom.minidom 来解析 python 中的一些 XML。解析后，我正在对内容进行一些报告，并希望报告源 XML 文档中标记开始的行（和列），但我不知道这是怎么可能的。

如果可能的话，我想坚持使用 xml.dom / xml.dom.minidom，但如果我需要使用 SAX 解析器来获取原始信息，我可以这样做——在这种情况下，理想的情况是使用 SAX 来跟踪节点位置，但最终仍然会得到一个用于我的后处理的 DOM。

关于如何执行此操作有什么建议吗？希望我只是忽略了文档中的一些内容，这非常简单。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枉心 2024-10-20 19:41:16

通过猴子修补 minidom 内容处理程序，我能够记录每个节点的行号和列号（作为“parse_position”属性）。这有点脏，但我看不到任何“官方批准”的做法:)这是我的测试脚本：

from xml.dom import minidom
import xml.sax

doc = """\
<File>
  <name>Name</name>
  <pos>./</pos>
</File>
"""


def set_content_handler(dom_handler):
    def startElementNS(name, tagName, attrs):
        orig_start_cb(name, tagName, attrs)
        cur_elem = dom_handler.elementStack[-1]
        cur_elem.parse_position = (
            parser._parser.CurrentLineNumber,
            parser._parser.CurrentColumnNumber
        )

    orig_start_cb = dom_handler.startElementNS
    dom_handler.startElementNS = startElementNS
    orig_set_content_handler(dom_handler)

parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler

dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
    dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
    if child.localName is None:
        continue
    pos = child.parse_position
    print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])

它输出以下内容：

Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2

By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:

from xml.dom import minidom
import xml.sax

doc = """\
<File>
  <name>Name</name>
  <pos>./</pos>
</File>
"""


def set_content_handler(dom_handler):
    def startElementNS(name, tagName, attrs):
        orig_start_cb(name, tagName, attrs)
        cur_elem = dom_handler.elementStack[-1]
        cur_elem.parse_position = (
            parser._parser.CurrentLineNumber,
            parser._parser.CurrentColumnNumber
        )

    orig_start_cb = dom_handler.startElementNS
    dom_handler.startElementNS = startElementNS
    orig_set_content_handler(dom_handler)

parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler

dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
    dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
    if child.localName is None:
        continue
    pos = child.parse_position
    print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])

It outputs the following:

Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2

回复收藏 0 原文

墟烟 2024-10-20 19:41:16

解决该问题的另一种方法是在解析文档之前将行号信息修补到文档中。想法是这样的：

LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
  f = file.open(filename, 'r')
  l = 0
  content = list ()
  for line in f:
    l += 1
    content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
  f.close ()

  return minidom.parseString ("".join(content))

然后您可以检索元素的行号

int (element.getAttribute (LINE_DUMMY_ATTR))

。很明显，这种方法有其自身的缺点，如果您确实也需要列号，则修补它会更加复杂。另外，如果您想提取文本节点或注释或使用 Node.toXml()，则必须确保从任何意外匹配中删除 LINE_DUMMY_ATTR。

与 aknuds1 的答案相比，该解决方案的一个优点是它不需要弄乱 minidom 内部结构。

A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:

LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
  f = file.open(filename, 'r')
  l = 0
  content = list ()
  for line in f:
    l += 1
    content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
  f.close ()

  return minidom.parseString ("".join(content))

Then you can retrieve the line number of an element with

int (element.getAttribute (LINE_DUMMY_ATTR))

Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.

The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.

回复收藏 0 原文

~没有更多了~