有没有办法从 ElementTree 元素获取行号

发布于 2024-11-28 03:40:46 字数 109 浏览 5 评论 0原文

因此,我使用 Python 3.2.1 的 cElementTree 解析一些 XML 文件,在解析过程中我注意到一些标签缺少属性信息。我想知道是否有任何简单的方法可以获取 xml 文件中这些元素的行号。

So I'm parsing some XML files using Python 3.2.1's cElementTree, and during the parsing I noticed that some of the tags were missing attribute information. I was wondering if there is any easy way of getting the line numbers of those Elements in the xml file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

洋洋洒洒 2024-12-05 03:40:46

我花了一段时间才弄清楚如何使用 Python 3.x(这里使用 3.3.2)来做到这一点,所以我想总结一下:

# Force python XML parser not faster C accelerators
# because we can't hook the C implementation
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET

class LineNumberingParser(ET.XMLParser):
    def _start_list(self, *args, **kwargs):
        # Here we assume the default XML parser which is expat
        # and copy its element position attributes into output Elements
        element = super(self.__class__, self)._start_list(*args, **kwargs)
        element._start_line_number = self.parser.CurrentLineNumber
        element._start_column_number = self.parser.CurrentColumnNumber
        element._start_byte_index = self.parser.CurrentByteIndex
        return element

    def _end(self, *args, **kwargs):
        element = super(self.__class__, self)._end(*args, **kwargs)
        element._end_line_number = self.parser.CurrentLineNumber
        element._end_column_number = self.parser.CurrentColumnNumber
        element._end_byte_index = self.parser.CurrentByteIndex
        return element

tree = ET.parse(filename, parser=LineNumberingParser())

Took a while for me to work out how to do this using Python 3.x (using 3.3.2 here) so thought I would summarize:

# Force python XML parser not faster C accelerators
# because we can't hook the C implementation
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET

class LineNumberingParser(ET.XMLParser):
    def _start_list(self, *args, **kwargs):
        # Here we assume the default XML parser which is expat
        # and copy its element position attributes into output Elements
        element = super(self.__class__, self)._start_list(*args, **kwargs)
        element._start_line_number = self.parser.CurrentLineNumber
        element._start_column_number = self.parser.CurrentColumnNumber
        element._start_byte_index = self.parser.CurrentByteIndex
        return element

    def _end(self, *args, **kwargs):
        element = super(self.__class__, self)._end(*args, **kwargs)
        element._end_line_number = self.parser.CurrentLineNumber
        element._end_column_number = self.parser.CurrentColumnNumber
        element._end_byte_index = self.parser.CurrentByteIndex
        return element

tree = ET.parse(filename, parser=LineNumberingParser())
陌上芳菲 2024-12-05 03:40:46

查看文档,我发现没有办法使用 cElementTree 来做到这一点。

不过,我很幸运地使用了 XML 实现的 lxml 版本。
使用 libxml2,它应该几乎是替代品。并且元素具有 sourceline 属性。 (以及获得许多其他 XML 功能)。

唯一需要注意的是,我只在 python 2.x 中使用过它 - 不确定它如何/是否在 3.x 下工作 - 但可能值得一看。

附录:
他们在首页上说:

lxml XML 工具包是 C 库 libxml2 的 Pythonic 绑定
和 libxslt。它的独特之处在于它结合了速度和 XML
这些库的功能完整性和简单性
原生Python API,大部分兼容但优于众所周知的
元素树 API。最新版本适用于所有 CPython 版本
从 2.3 到 3.2。请参阅介绍以了解更多信息
lxml 项目的背景和目标。一些常见问题是
在常见问题解答中回答。

所以看起来 python 3.x 是可以的。

Looking at the docs, I see no way to do this with cElementTree.

However I've had luck with lxmls version of the XML implementation.
Its supposed to be almost a drop in replacement, using libxml2. And elements have a sourceline attribute. (As well as getting a lot of other XML features).

Only caveat is that I've only used it in python 2.x - not sure how/if it works under 3.x - but might be worth a look.

Addendum:
from their front page they say :

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2
and libxslt. It is unique in that it combines the speed and XML
feature completeness of these libraries with the simplicity of a
native Python API, mostly compatible but superior to the well-known
ElementTree API. The latest release works with all CPython versions
from 2.3 to 3.2. See the introduction for more information about
background and goals of the lxml project. Some common questions are
answered in the FAQ.

So it looks like python 3.x is OK.

ゝ偶尔ゞ 2024-12-05 03:40:46

我通过子类化 ElementTree.XMLTreeBuilder 在 elementtree 中完成了此操作。然后,我可以访问 self._parser (Expat),它具有属性 _parser.CurrentLineNumber 和 _parser.CurrentColumnNumber。

http://docs.python.org/py3k /library/pyexpat.html?highlight=xml.parser#xmlparser-objects 包含有关这些属性的详细信息

在解析过程中,您可以打印出信息,或将这些值放入输出 XML 元素中 属性。

如果您的 XML 文件包含其他 XML 文件,则您必须执行一些我不记得且没有详细记录的操作来跟踪当前的 XML 文件。

I've done this in elementtree by subclassing ElementTree.XMLTreeBuilder. Then where I have access to the self._parser (Expat) it has properties _parser.CurrentLineNumber and _parser.CurrentColumnNumber.

http://docs.python.org/py3k/library/pyexpat.html?highlight=xml.parser#xmlparser-objects has details about these attributes

During parsing you could print out info, or put these values into the output XML element attributes.

If your XML file includes additional XML files, you have to do some stuff that I don't remember and was not well documented to keep track of the current XML file.

谁与争疯 2024-12-05 03:40:46

一种(黑客)方法是在解析之前将保存行号的虚拟属性插入到每个元素中。以下是我使用 minidom 执行此操作的方法:

python 报告XML 节点的行/列

这可以简单地调整为 cElementTree(或者实际上任何其他 python XML 解析器)。

One (hackish) way of doing this is by inserting a dummy-attribute holding the line number into each element, before parsing. Here's how I did this with minidom:

python reporting line/column of origin of XML node

This can be trivially adjusted to cElementTree (or in fact any other python XML parser).

記柔刀 2024-12-05 03:40:46

另一种方法是在解析行时跟踪它们,并使用 ElementTree.iterparse 方法。下面的代码一次只向 XML 解析器返回一行,并且侦听器可以获得当前行号。它对专栏没有帮助,但考虑到 OG 问题是关于行号的,这是可行的。您还可以通过侦听“结束”事件并设置不同的属性等来设置结束行号。

class XmlLineReader:
    """Iterates over an XML file line-by-line, keeping track of the current line."""
    def __init__(self, xml_file) -> None:
        self._iter = iter(xml_file)
        self._current_line = -1

    @property
    def line(self): return self._current_line
    
    def read(self, *_):
        try:
            self._current_line += 1
            return next(self._iter)
        except:
            return None

source = XmlLineReader(xml_file)
iter = ElementTree.iterparse(source, ("start"))
for _, elem in iter:
    elem.set("xml_lineno", str(source.line))

Another way to do this is to keep track of the lines as they are parsed, and use the ElementTree.iterparse method. The below code only returns one line at a time to the XML parser, and the listener can get the current line number. It doesn't help with the column, but given the OG question is about the line number, this works. You could also set the ending line number by listening for the "end" event and setting a different attribute, etc.

class XmlLineReader:
    """Iterates over an XML file line-by-line, keeping track of the current line."""
    def __init__(self, xml_file) -> None:
        self._iter = iter(xml_file)
        self._current_line = -1

    @property
    def line(self): return self._current_line
    
    def read(self, *_):
        try:
            self._current_line += 1
            return next(self._iter)
        except:
            return None

source = XmlLineReader(xml_file)
iter = ElementTree.iterparse(source, ("start"))
for _, elem in iter:
    elem.set("xml_lineno", str(source.line))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文