如何在不解析整个文件的情况下获取树的根?

发布于 2024-10-19 13:50:02 字数 506 浏览 6 评论 0原文

我正在制作一个 xml 解析器来解析来自不同工具的 xml 报告,并且每个工具都会生成带有不同标签的不同报告。

例如:

Arachni 生成一个 xml 报告,其中 作为树根标记。

nmap 生成一个 xml 报告,其中 作为树根标记。

我试图不解析整个文件,除非它是来自我想要的任何工具的有效报告。

我首先想到使用的是ElementTree,解析整个xml文件(假设它包含有效的xml),然后根据树根检查报告是否属于Arachni或nmap。

我目前正在使用 cElementTree,据我所知 getroot() 在这里不是一个选项,但我的目标是使这个解析器仅处理已识别的文件,而不解析不必要的文件。

顺便说一下,我还在学习 xml 解析,提前致谢。

I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.

For example:

Arachni generates an xml report with <arachni_report></arachni_report> as tree root tag.

nmap generates an xml report with <nmaprun></nmaprun> as tree root tag.

I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.

First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.

I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.

By the way, I'm Still learning about xml parsing, thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

夕色琉璃 2024-10-26 13:50:02

“简单的字符串方法”是万恶之源(双关语)——请参见下面的示例。

更新 2 代码和输出现在表明建议的正则表达式也不能很好地工作。

使用元素树。您正在寻找的函数是iterparse。启用“开始”事件。第一次迭代就出局。

代码:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re

xml_text_1 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
"""

xml_text_2 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
<!--
That's all, folks! 
-->
"""

xml_text_3 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo /></root>
<!-- </mole2> -->'''

xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''

for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
    print
    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x:]
    print "*** eyquem 1:", repr(lastline.strip())

    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = xml_text[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print "*** eyquem 2:", repr(chrstr[x+1:])
    else:
        print "*** eyquem 2:", repr(lastline)

    m = None
    for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
        pass
    if m: print "*** eyquem 3:", repr(m.group())
    else: print "*** eyquem 3:", "FAIL"

    m = None
    for m in re.finditer('</[^>]+>', xml_text):
        pass
    if m: print "*** eyquem 4:", repr(m.group())
    else: print "*** eyquem 4:", "FAIL"

    m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
    if m: print "*** eyquem 5:", repr(m.group())
    else: print "*** eyquem 5:", "FAIL"

    m = re.search('<(?![?!])[^>]+>', xml_text)
    if m: print "*** eyquem 6:", repr(m.group())
    else: print "*** eyquem 6:", "FAIL"

    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "*** parse:", tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "*** iterparse:", elem.tag
        break

以上 ElementTree 相关代码适用于 Python 2.5 至 2.7。可与 Python 2.2 至 2.4 配合使用;您只需要从 effbot.org 获取 ElementTree 和 cElementTree 并进行一些条件导入。应该适用于任何 lxml 版本。

输出:

*** eyquem 1: '>'
*** eyquem 2: '>'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '-->'
*** eyquem 2: '-->'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '<!-- </mole2> -->'
*** eyquem 2: '<root><foo /></root>'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: '<root>'
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

*** eyquem 1: '>'
*** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: FAIL
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

更新 1 以上是演示代码。下面更像是实现代码...只需添加异常处理。使用 Python 2.7 和 2.2 进行测试。

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import cElementTree as ET

def get_root_tag_from_xml_file(xml_file_path):
    result = f = None
    try:
        f = open(xml_file_path, 'rb')
        for event, elem in ET.iterparse(f, ('start', )):
            result = elem.tag
            break
    finally:
        if f: f.close()
    return result

if __name__ == "__main__":
    import sys, glob
    for pattern in sys.argv[1:]:
        for filename in glob.glob(pattern):
            print filename, get_root_tag_from_xml_file(filename)

"simple string methods" are the root [pun intended] of all evil -- see examples below.

Update 2 Code and output now show that proposed regexes also don't work very well.

Use ElementTree. The function that you are looking for is iterparse. Enable "start" events. Bale out on the first iteration.

Code:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re

xml_text_1 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
"""

xml_text_2 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
<!--
That's all, folks! 
-->
"""

xml_text_3 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo /></root>
<!-- </mole2> -->'''

xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''

for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
    print
    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x:]
    print "*** eyquem 1:", repr(lastline.strip())

    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = xml_text[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print "*** eyquem 2:", repr(chrstr[x+1:])
    else:
        print "*** eyquem 2:", repr(lastline)

    m = None
    for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
        pass
    if m: print "*** eyquem 3:", repr(m.group())
    else: print "*** eyquem 3:", "FAIL"

    m = None
    for m in re.finditer('</[^>]+>', xml_text):
        pass
    if m: print "*** eyquem 4:", repr(m.group())
    else: print "*** eyquem 4:", "FAIL"

    m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
    if m: print "*** eyquem 5:", repr(m.group())
    else: print "*** eyquem 5:", "FAIL"

    m = re.search('<(?![?!])[^>]+>', xml_text)
    if m: print "*** eyquem 6:", repr(m.group())
    else: print "*** eyquem 6:", "FAIL"

    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "*** parse:", tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "*** iterparse:", elem.tag
        break

Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.

Output:

*** eyquem 1: '>'
*** eyquem 2: '>'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '-->'
*** eyquem 2: '-->'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '<!-- </mole2> -->'
*** eyquem 2: '<root><foo /></root>'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: '<root>'
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

*** eyquem 1: '>'
*** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: FAIL
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import cElementTree as ET

def get_root_tag_from_xml_file(xml_file_path):
    result = f = None
    try:
        f = open(xml_file_path, 'rb')
        for event, elem in ET.iterparse(f, ('start', )):
            result = elem.tag
            break
    finally:
        if f: f.close()
    return result

if __name__ == "__main__":
    import sys, glob
    for pattern in sys.argv[1:]:
        for filename in glob.glob(pattern):
            print filename, get_root_tag_from_xml_file(filename)
浅紫色的梦幻 2024-10-26 13:50:02

我对你的问题的理解是这样的:你想要检查一个文件以确定它是否是你可识别的格式之一,并且只有在你知道它是可识别的格式之一时才将其解析为 XML。 @eyquem 是对的:您应该使用简单的字符串方法。

最简单的做法是从文件开头读取一些少量内容,看看它是否有您认识的根元素:

f = open(the_file)
head = f.read(200)
f.close()
if "<arachni_report" in head:
    #.. re-open and parse as arachni ..
elif "<nmaprun" in head:
    #.. re-open and parse as nmaprun ..

此方法的优点是在确定文件是否有趣之前只读取文件的少量内容文件与否。

My understanding of your problem is this: You want to examine a file to determine if it is one of your recognized formats, and only parse it as XML if you know that it is one of the recognized formats. @eyquem is right: you should use simple string methods.

The simplest thing to do is to read some small amount from the beginning of the file, and see if it has a root element you recognize:

f = open(the_file)
head = f.read(200)
f.close()
if "<arachni_report" in head:
    #.. re-open and parse as arachni ..
elif "<nmaprun" in head:
    #.. re-open and parse as nmaprun ..

This method has the advantage that only a small amount of the file is read before determining whether it's an interesting file or not.

酷炫老祖宗 2024-10-26 13:50:02

对于 XML 行家来说,这看起来有趣吗? :

ch = """\
<?xml version="1.0" encoding="ISO-8859-1" ?> 
<!--  Edited by XMLSpy® --> 
<CATALOG>
 <CD>
  <TITLE>Empire Burlesque</TITLE> 
  <ARTIST>Bob Dylan</ARTIST> 
  <COUNTRY>USA</COUNTRY> 
  <COMPANY>Columbia</COMPANY> 
  <PRICE>10.90</PRICE> 
  <YEAR>1985</YEAR> 
 </CD>
 <CD>
  <TITLE>Hide your heart</TITLE> 
  <ARTIST>Bonnie Tyler</ARTIST> 
  <COUNTRY>UK</COUNTRY> 
  <COMPANY>CBS Records</COMPANY> 
  <PRICE>9.90</PRICE> 
  <YEAR>1988</YEAR> 
 </CD>
</CATALOG>
<!-- This is the end of arachni report --> 

"""

chrstr = ch.strip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
lastline = chrstr[x+1:]
if lastline[0:5]=='<!-- ':
    chrstr = ch[0:x].rstrip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    print chrstr[x+1:]
else:
    print lastline

结果,仍然

</CATALOG>

如果有必要,可以添加一个验证,以确保树根的开始标记也在文件的开头附近

如果文件很大,为了加快处理速度,我们可以将文件指针移到文件末尾附近(比如文件末尾前 200 或 600 个字符),以仅读取和搜索长度为 200 或 600 个字符的字符串( XL 树根的结束标记没有更长的长度,不是吗?)

from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    ch = f.read()
    if '\r' not in ch and '\n' not in ch:
        f.seek(0,0)
        ch = f.read()        

    chrstr = ch.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = ch[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print chrstr[x+1:]
    else:
        print lastline

Does this seem interesting to a connoisseur of XML ? :

ch = """\
<?xml version="1.0" encoding="ISO-8859-1" ?> 
<!--  Edited by XMLSpy® --> 
<CATALOG>
 <CD>
  <TITLE>Empire Burlesque</TITLE> 
  <ARTIST>Bob Dylan</ARTIST> 
  <COUNTRY>USA</COUNTRY> 
  <COMPANY>Columbia</COMPANY> 
  <PRICE>10.90</PRICE> 
  <YEAR>1985</YEAR> 
 </CD>
 <CD>
  <TITLE>Hide your heart</TITLE> 
  <ARTIST>Bonnie Tyler</ARTIST> 
  <COUNTRY>UK</COUNTRY> 
  <COMPANY>CBS Records</COMPANY> 
  <PRICE>9.90</PRICE> 
  <YEAR>1988</YEAR> 
 </CD>
</CATALOG>
<!-- This is the end of arachni report --> 

"""

chrstr = ch.strip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
lastline = chrstr[x+1:]
if lastline[0:5]=='<!-- ':
    chrstr = ch[0:x].rstrip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    print chrstr[x+1:]
else:
    print lastline

result, still

</CATALOG>

If necessary, one could add a verification that the start-tag of tree root is also around the beginning in the file

.

If the file is big, to speed up the treatment, we can move the pointeur of the file near the file's end (say 200 or 600 characters ante the end) to read and search in only a string of 200 or 600 characters long (the end-tag of the tree root of an XL doesn't have a greater length, does it ?)

from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    ch = f.read()
    if '\r' not in ch and '\n' not in ch:
        f.seek(0,0)
        ch = f.read()        

    chrstr = ch.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = ch[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print chrstr[x+1:]
    else:
        print lastline
辞慾 2024-10-26 13:50:02

John Machin,当你证明我的代码无法正常工作时,你是认真的吗?

由于我不太了解 XML 格式,所以我去了那里:

W3C's XML validator

结论是您的文本样本格式不正确。因此 :

« XML 文档的定义
排除包含以下内容的文本
违反格式良好规则;
它们根本就不是 XML。 »

http://en.wikipedia.org/wiki/XML

真正的邪恶就在这里。

你的意思是我应该编写了能够检测非 XML 文件中的树根标记的代码吗?我不知道我要满足这个过高的要求。

下面的代码比仅使用字符串方法的代码稍微吓人一点。我之前没有给出它,因为我会收到通知“..whisp...你不得使用正则表达式来分析 XML 文本...whisp Whisp”

import re
from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    for mat in re.finditer('^</[^>]+>',f.read(),re.MULTILINE):
        pass
    print mat.group()

它可以做同样的事情在 XML 的更嘈杂的开头。

事实上,我更喜欢 John Machin 给出的解决方案,使用 ElementTreeiterparse() 函数,就是这样!

编辑

毕竟,我想知道为什么这还不够......

import re

with open('I:\\uuu.txt') as f:
    print re.search('^<(?![?!])[^>]+>',f.read(),re.MULTILINE).group()

Are you serious, John Machin , when you show that my code wouldn't work correctly ?

Since I don't know well the XML format, I went there:

W3C's XML validator

Conclusion is that your text samples are not well-formed. Hence :

« The definition of an XML document
excludes texts which contain
violations of well-formedness rules;
they are simply not XML. »

http://en.wikipedia.org/wiki/XML

The real evil is here.

Did you mean that I was supposed to have written a code able to detect tree root's tag in non-XML files ? I didn't know I had this over-requirement to fulfill.

.

Here's a code that frightens a little less than the one using only string methods. I didn't give it before because I would have received notifications that "..whisp...you MUST not employ regexes to analyse an XML text... whisp whisp"

import re
from os.path import getsize

with open('I:\\uuu.txt') as f:

    L = getsize('I:\\uuu.txt')
    print 'L==',L

    f.seek( -min(600,L) , 2)
    for mat in re.finditer('^</[^>]+>',f.read(),re.MULTILINE):
        pass
    print mat.group()

It coud be done the same in the more noisy beginning of the XML.

In fact, I prefer the solution given by John Machin, with the iterparse() function of ElementTree, and that's it !

.

EDIT

After all, I wonder why this wouldn't be enough....

import re

with open('I:\\uuu.txt') as f:
    print re.search('^<(?![?!])[^>]+>',f.read(),re.MULTILINE).group()
〆凄凉。 2024-10-26 13:50:02

最终编辑:
感谢John Machin,我将根据他的答案(这是我选择的正确答案)使用以下代码(这是草稿)。

我还要感谢 eyquem 的回复和他坚持捍卫自己的代码,我真的学到了很多:)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from cStringIO import StringIO

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

class reportRecognizer(object):

    def __init__(self, xml_report_path):
        self.report_type = ""

        root_tag = self.get_root_tag_from_xml_file(xml_report_path)

        if root_tag:
            self.report_type = self.rType(root_tag)


    def get_root_tag_from_xml_file(self, xml_file_path):
        result = f = None
        try:
            f = open(xml_file_path, 'rb')
            for event, elem in ET.iterparse(f, ('start', )):
                result = elem.tag
                break
        except IOError, err:
            print "Error while opening file.\n%s. %s" % (err, filepath)
        finally:
            if f: f.close()
        return result

    def rType(self, tag):
        if "arachni_report" == tag:
            return "Arachni XML Report"
        elif "nmaprun" == tag:
            return "Nmap XML Report"
        else:
            return "Unrecognized XML file, sorry."

report = reportRecognizer('report.xml')
print report.report_type

Final edit:
Thanks to John Machin I'll be using this following code (this is a draft) based on his answer (which is the one I selected as correct).

I'd also like to thank eyquem for his responses and his persistence on defending his codes, I really learned a lot :)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from cStringIO import StringIO

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

class reportRecognizer(object):

    def __init__(self, xml_report_path):
        self.report_type = ""

        root_tag = self.get_root_tag_from_xml_file(xml_report_path)

        if root_tag:
            self.report_type = self.rType(root_tag)


    def get_root_tag_from_xml_file(self, xml_file_path):
        result = f = None
        try:
            f = open(xml_file_path, 'rb')
            for event, elem in ET.iterparse(f, ('start', )):
                result = elem.tag
                break
        except IOError, err:
            print "Error while opening file.\n%s. %s" % (err, filepath)
        finally:
            if f: f.close()
        return result

    def rType(self, tag):
        if "arachni_report" == tag:
            return "Arachni XML Report"
        elif "nmaprun" == tag:
            return "Nmap XML Report"
        else:
            return "Unrecognized XML file, sorry."

report = reportRecognizer('report.xml')
print report.report_type
孤独患者 2024-10-26 13:50:02

约翰·梅钦,这就是您要找的东西:我们的连续剧的续集。我验证了这次我的大脑在正确的位置,我继续思考这个问题。

这样您就扩展了演示代码。
现在,通过您的几个示例文本,我很清楚字符串方法还远远不够,我明白为什么。我非常有兴趣了解流程的底层并了解肯定的具体原因。

然后,我比以前更深入地研究了 XML 规范,并使用 W3c 的验证器进行了测试,以加深对 XML 文本结构细节的理解。这是一个相当严峻的职业,但也很有趣。我发现 XML 的格式是非常严格的规则和温文尔雅的自由的混合体。

从您在示例中使用的将我的代码分成几部分的技巧,我得出的结论是 XML 格式不需要将文本分成行。事实上,正如 W3c 的验证器向我展示的那样,字符 \n\r\t 可以可以出现在 XML 文本中的许多位置,前提是它们不违反结构规则。

例如,它们的授权不受标签之间的任何限制:因此,一个元素可能会占用多行。此外,即使标签也可以分成几行,或者分成几个表格 \t,前提是它们出现在一个标签的名称之后。 XML 文本的行不需要缩进,就像我一直看到的那样:我现在明白,这只是为了便于阅读和逻辑理解而选择的个人方便。

好吧,你比我更了解这一切,约翰·梅钦。感谢您,我现在意识到 XML 格式的复杂性,并且我更好地理解了通过专用解析器以外的其他方式进行解析不切实际的原因。我顺便想知道普通编码人员是否意识到 XML 格式的这种尴尬:XML 文本中可能会出现 \n 字符。

不管怎样,因为我已经在这个概念沸腾的锅里呆了一段时间了,所以我继续为你的 wac_moles 寻找解决方案,John Machin,作为一个有启发性的游戏。

字符串方法不再适用,我完善了我的正则表达式。

我知道,我知道:您会说即使使用正则表达式也无法分析 XML 文本。现在我更清楚为什么了,我同意。但我不会假装解析 XML 文本:我的正则表达式不会提取 XML 树的任何部分,它只会搜索一小部分文本。对于OP提出的问题,我认为使用正则表达式不是异端。

从一开始,我就认为搜索根的结束标签更容易和自然,因为结束标签没有属性,并且它周围的“噪音”比根的开始标签少。

所以我现在的解决方案是:

~~打开XML文件

~~将文件指针移到距离末尾-200的位置

~~读取文件的最后200个字符

~~这里有两个策略:

  1. 仅删除注释,然后使用正则表达式搜索标记,并考虑字符 \n、\r、\t
  2. 或者在搜索之前删除注释和所有字符\n、\r、\t
    具有更简单正则表达式的标签

越大,与使用 parse 或 iterparse 相比,该算法的速度越快。我编写并检查了以下代码的所有结果。第一个策略是更快的策略。

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re,urllib

xml5 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root\t
\r\t\r \r
><foo

>bar</foo\t \r></root
>
"""

xml6 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo
>bar</foo\n\t   \t></root \t
\r>
<!--  \r   \t
That's all, folks!

\t-->
"""

xml7 = '''<?xml version="1.0" ?>
<!-- <mole1> -->  
<root><foo

\t\t\r\r\t/></root  \t
>  
<!-- </mole2>\t \r
 \r-->
<!---->
'''

xml8 = '''<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->'''


sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
xml9 = sock.read()
sock.close()


def rp(x):
    return  '\\r' if x.group()=='\r' else '\\t'

for xml_text in (xml5, xml6, xml7, xml8, xml9):

    print '\\n\n'.join(re.sub('\r|\t',rp,xml_text).split('\n'))
    print '-----------------------------'

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    m  = re.search(RE11, xml_text_noc,re.DOTALL)
    print "***  eyquem 11: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    m  = re.search(RE12, xml_text_noc,re.DOTALL)
    print "***  eyquem 12: " + repr(m.group(1) if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    m  = re.search(RE13, xml_text_noc,re.DOTALL)
    print "***  eyquem 13: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    m  = re.search(RE14, xml_text_noc,re.DOTALL)
    print "***  eyquem 14: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    m  = re.search(RE15, xml_text_noc,re.DOTALL)
    print "***  eyquem 15: " + repr(m.group(1).rstrip() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    m  = re.search(RE16, xml_text_noc,re.DOTALL)
    print "***  eyquem 16: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    print
    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "***      parse:  " + tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "***  iterparse:  " + elem.tag
        break


    print '\n=============================================' 

结果

<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\t\n
\r\t\r \r\n
><foo\n
\n
>bar</foo\t \r></root\n
>\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\n
><foo\n
>bar</foo\n
\t   \t></root \t\n
\r>\n
<!--  \r   \t\n
That's all, folks!\n
\n
\t-->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?>\n
<!-- <mole1> -->  \n
<root><foo\n
\n
\t\t\r\r\t/></root  \t\n
>  \n
<!-- </mole2>\t\n
-->\n
<!---->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->
-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0"?>\r\n
<stylesheet\r\n
  xmlns="http://www.w3.org/XSL/Transform/1.0"\r\n
  xmlns:fo="http://www.w3.org/XSL/Format/1.0"\r\n
  result-ns="fo">\r\n
\r\n
  <template match="/">\r\n
    <fo:root xmlns:fo="http://www.w3.org/XSL/Format/1.0">\r\n
\r\n
      <fo:layout-master-set>\r\n
        <fo:simple-page-master page-master-name="only">\r\n
          <fo:region-body/>\r\n
        </fo:simple-page-master>\r\n
      </fo:layout-master-set>\r\n
\r\n
      <fo:page-sequence>\r\n
\r\n
       <fo:sequence-specification>\r\n
        <fo:sequence-specifier-single page-master-name="only"/>\r\n
       </fo:sequence-specification>\r\n
        \r\n
        <fo:flow>\r\n
          <apply-templates select="//ATOM"/>\r\n
        </fo:flow>\r\n
        \r\n
      </fo:page-sequence>\r\n
\r\n
    </fo:root>\r\n
  </template>\r\n
\r\n
  <template match="ATOM">\r\n
    <fo:block font-size="20pt" font-family="serif">\r\n
      <value-of select="NAME"/>\r\n
    </fo:block>\r\n
  </template>\r\n
\r\n
</stylesheet>\r\n

-----------------------------
***  eyquem 11: 'stylesheet'
***  eyquem 12: 'stylesheet'
***  eyquem 13: 'stylesheet'
***  eyquem 14: 'stylesheet'
***  eyquem 15: 'stylesheet'
***  eyquem 16: 'stylesheet'

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet

=============================================

此代码现在测量执行时间:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re
import urllib
from time import clock

sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
ch = sock.read()
sock.close()

# the following lines are intended to insert additional lines
# into the XML text before its recording in a file, in order to
# obtain a real file to use, containing an XML text 
# long enough to observe easily the timing's differences

li = ch.splitlines(True)[0:6] + 30*ch.splitlines(True)[6:-2] + ch.splitlines(True)[-2:]

with open('xml_example.xml','w') as f:
    f.write(''.join(li))

print 'length of XML text in a file : ',len(''.join(li)),'\n'



# timings

P,I,A,B,C,D,E,F = [],[],[],[],[],[],[],[],


n = 50

for cnt in xrange(50):

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            tree = et.parse(filelike_obj)
            res_parse = tree.getroot().tag
    P.append( clock()-te)

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
                res_iterparse = elem.tag
                break
    I.append(  clock()-te)


    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE11, xml_text_noc,re.DOTALL)
            res_eyq11 = m.group() if m else "FAIL"
    A.append(  clock()-te)


    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE12, xml_text_noc,re.DOTALL)
            res_eyq12 = m.group(1) if m else "FAIL"
    B.append(  clock()-te)


    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE13, xml_text_noc,re.DOTALL)
            res_eyq13 = m.group()[2:-1] if m else "FAIL"
    C.append(  clock()-te)



    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE14, xml_text_noc,re.DOTALL)
            res_eyq14 = m.group() if m else "FAIL"
    D.append(  clock()-te)


    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE15, xml_text_noc,re.DOTALL)
            res_eyq15 = m.group(1) if m else "FAIL"
    E.append(  clock()-te)


    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE16, xml_text_noc,re.DOTALL)
            res_eyq16 = m.group()[2:-1].rstrip() if m else "FAIL"
    F.append(  clock()-te)


print "***      parse:  " + res_parse, '  parse'
print "***  iterparse:  " + res_iterparse, '  iterparse'
print
print "***  eyquem 11:  " + repr(res_eyq11)
print "***  eyquem 12:  " + repr(res_eyq12)
print "***  eyquem 13:  " + repr(res_eyq13)
print "***  eyquem 14:  " + repr(res_eyq14)
print "***  eyquem 15:  " + repr(res_eyq15)
print "***  eyquem 16:  " + repr(res_eyq16)

print
print str(min(P))
print str(min(I))
print
print '\n'.join(str(u) for u in map(min,(A,B,C)))
print
print '\n'.join(str(u) for u in map(min,(D,E,F)))

结果:

length of XML text in a file :  22548 

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   parse
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   iterparse

***  eyquem 11:  'stylesheet'
***  eyquem 12:  'stylesheet'
***  eyquem 13:  'stylesheet'
***  eyquem 14:  'stylesheet'
***  eyquem 15:  'stylesheet'
***  eyquem 16:  'stylesheet'

0.220554691169
0.172240771802

0.0273236743636
0.0266525536625
0.0265308269626

0.0246300539733
0.0241203758299
0.0238024015203

考虑到您的简单需求,Aereal,我认为您不关心根的结束标记可能包含字符 \r \n \t 在其中,而不是单独的名称;因此,在我看来,最适合您的解决方案是:

def get_root_tag_from_xml_file(xml_file_path):
    with open(xml_file_path) as f:
        try:      f.seek(-200,2)
        except:   f.seek(0,0)
        finally:  xml_text_noc = re.sub('<!--.*?-->','', f.read(), flags= re.DOTALL)
        try:
            return re.search('</[^>]+>(?!.*</[^>]+>)' , xml_text_noc, re.DOTALL).group()
        except :
            return 'FAIL'

由于 John Machin 的专业知识,该解决方案比我以前的解决方案做得更可靠;此外,它完全满足了需求,正如所表达的那样:无需解析,因此是一种更快的方法,正如它隐含的目标一样。

John Machin,您是否会发现 XML 格式的一个新的棘手功能会使该解决方案失效?

Here is what you were looking for, John Machin: the sequel of our serial . I verified that this time my brain was in its correct place, and I continued to think about the problem.

So you have extended the demonstration code.
Now, with your several exemplifying texts, it is clear for me that the string methods are far to be sufficient, and I UNDERSTAND why. I am very interested to know the underneath of processes and to understand the concrete reasons of affirmations.

Then I studied more than I ever did the specifications of XML and performed tests with the W3c's validator to increase my understanding of details of the structure of a XML text. It's a rather severe occupation but interesting though. I saw that the format of an XML is a mix of very strict rules and of debonair liberties.

From the tricks you used in your exemples to tear my codes into pieces, I conclude that XML format doesn't require the text to be divided into lines. In fact, as the W3c's validator showed me, characters \n , \r and \t can be at many positions in a XML text, provided that they don't break a rule of structure.

For exemple they are authorized without any restriction between tags: as a consequence, an element may occupy several lines. Also, even tags can be splitted into several lines, or among several tabulations \t, provided that they occur after the name of one tag. There is nor requirement for the lines of a XML text to be indented as I always saw them: I understand now it's only a personal convenience choosen for ease of reading and logical comprehension.

Well, you know all that better than me, John Machin. Thanks to you, I am now alerted to the complexity of XML format and I better understand the reasons that make parsing unrealistic by other means than specialized parsers. I incidentally wonder if common coders are aware of this awkardness of XML format: the possibility of \n characters present here and there in an XML text.

.

Anyway, as I have been in this conceptual boiling pot for a while now, I continued to search for a solution for your whac_moles, John Machin, as an instructive play.

String methods being out of the game, I polished my regex.

I know, I know: you'll say me that analyzing an XML text can't be done even with a regex. Now that I know better why, I agree. But I don't pretend to parse an XML text: my regex won't extract any part of an XML tree, it will search only a little chunk of text. For the problem asked by OP, I consider the use of regex as non heretical.

.

From the beginning, I think that searching the end-tag of the root is more easy and natural, because an end-tag hasn't attributes and there is less "noise" around it than the start-tag of the root.

So my solution is now:

~~ open the XML file

~~ move the file's pointer to the position -200 from the end

~~ read the 200 last characters of the file

~~ here, two strategies:

  1. either remove only the comments and then searching the tag with a regex taking the characters \n, \r, \t in account
  2. or remove the comments and all the characters \n, \r, \t before searching
    the tag with a simpler regex

The bigger the file is, the speeder is this algorithm compared to the use of parse or iterparse. I wrote and examined all the results of the following codes. The first strategy is the faster one.

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re,urllib

xml5 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root\t
\r\t\r \r
><foo

>bar</foo\t \r></root
>
"""

xml6 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo
>bar</foo\n\t   \t></root \t
\r>
<!--  \r   \t
That's all, folks!

\t-->
"""

xml7 = '''<?xml version="1.0" ?>
<!-- <mole1> -->  
<root><foo

\t\t\r\r\t/></root  \t
>  
<!-- </mole2>\t \r
 \r-->
<!---->
'''

xml8 = '''<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->'''


sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
xml9 = sock.read()
sock.close()


def rp(x):
    return  '\\r' if x.group()=='\r' else '\\t'

for xml_text in (xml5, xml6, xml7, xml8, xml9):

    print '\\n\n'.join(re.sub('\r|\t',rp,xml_text).split('\n'))
    print '-----------------------------'

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    m  = re.search(RE11, xml_text_noc,re.DOTALL)
    print "***  eyquem 11: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    m  = re.search(RE12, xml_text_noc,re.DOTALL)
    print "***  eyquem 12: " + repr(m.group(1) if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->|[\n\r\t]','', xml_text,flags=re.DOTALL)
    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    m  = re.search(RE13, xml_text_noc,re.DOTALL)
    print "***  eyquem 13: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    m  = re.search(RE14, xml_text_noc,re.DOTALL)
    print "***  eyquem 14: " + repr(m.group() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    m  = re.search(RE15, xml_text_noc,re.DOTALL)
    print "***  eyquem 15: " + repr(m.group(1).rstrip() if m else "FAIL")

    xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    m  = re.search(RE16, xml_text_noc,re.DOTALL)
    print "***  eyquem 16: " + repr(m.group()[2:-1].rstrip() if m else "FAIL")



    print
    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "***      parse:  " + tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "***  iterparse:  " + elem.tag
        break


    print '\n=============================================' 

Result

<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\t\n
\r\t\r \r\n
><foo\n
\n
>bar</foo\t \r></root\n
>\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?> \n
<!--  this is a comment --> \n
<root\n
><foo\n
>bar</foo\n
\t   \t></root \t\n
\r>\n
<!--  \r   \t\n
That's all, folks!\n
\n
\t-->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?>\n
<!-- <mole1> -->  \n
<root><foo\n
\n
\t\t\r\r\t/></root  \t\n
>  \n
<!-- </mole2>\t\n
-->\n
<!---->\n

-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0" ?><!-- \r<mole1> --><root>  \t\t<foo \t\r\r/></root>\t<!-- </mole2> -->
-----------------------------
***  eyquem 11: 'root'
***  eyquem 12: 'root'
***  eyquem 13: 'root'
***  eyquem 14: 'root'
***  eyquem 15: 'root'
***  eyquem 16: 'root'

***      parse:  root
***  iterparse:  root

=============================================
<?xml version="1.0"?>\r\n
<stylesheet\r\n
  xmlns="http://www.w3.org/XSL/Transform/1.0"\r\n
  xmlns:fo="http://www.w3.org/XSL/Format/1.0"\r\n
  result-ns="fo">\r\n
\r\n
  <template match="/">\r\n
    <fo:root xmlns:fo="http://www.w3.org/XSL/Format/1.0">\r\n
\r\n
      <fo:layout-master-set>\r\n
        <fo:simple-page-master page-master-name="only">\r\n
          <fo:region-body/>\r\n
        </fo:simple-page-master>\r\n
      </fo:layout-master-set>\r\n
\r\n
      <fo:page-sequence>\r\n
\r\n
       <fo:sequence-specification>\r\n
        <fo:sequence-specifier-single page-master-name="only"/>\r\n
       </fo:sequence-specification>\r\n
        \r\n
        <fo:flow>\r\n
          <apply-templates select="//ATOM"/>\r\n
        </fo:flow>\r\n
        \r\n
      </fo:page-sequence>\r\n
\r\n
    </fo:root>\r\n
  </template>\r\n
\r\n
  <template match="ATOM">\r\n
    <fo:block font-size="20pt" font-family="serif">\r\n
      <value-of select="NAME"/>\r\n
    </fo:block>\r\n
  </template>\r\n
\r\n
</stylesheet>\r\n

-----------------------------
***  eyquem 11: 'stylesheet'
***  eyquem 12: 'stylesheet'
***  eyquem 13: 'stylesheet'
***  eyquem 14: 'stylesheet'
***  eyquem 15: 'stylesheet'
***  eyquem 16: 'stylesheet'

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet

=============================================

This code now measures the execution's times:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re
import urllib
from time import clock

sock = urllib.urlopen('http://www.cafeconleche.org/books/bible/examples/18/18-4.xsl')
ch = sock.read()
sock.close()

# the following lines are intended to insert additional lines
# into the XML text before its recording in a file, in order to
# obtain a real file to use, containing an XML text 
# long enough to observe easily the timing's differences

li = ch.splitlines(True)[0:6] + 30*ch.splitlines(True)[6:-2] + ch.splitlines(True)[-2:]

with open('xml_example.xml','w') as f:
    f.write(''.join(li))

print 'length of XML text in a file : ',len(''.join(li)),'\n'



# timings

P,I,A,B,C,D,E,F = [],[],[],[],[],[],[],[],


n = 50

for cnt in xrange(50):

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            tree = et.parse(filelike_obj)
            res_parse = tree.getroot().tag
    P.append( clock()-te)

    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as filelike_obj:
            for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
                res_iterparse = elem.tag
                break
    I.append(  clock()-te)


    RE11 = '(?<=</)[^ >]+(?= *>)(?!.*</[^>]+>)' # with assertions   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE11, xml_text_noc,re.DOTALL)
            res_eyq11 = m.group() if m else "FAIL"
    A.append(  clock()-te)


    RE12 = '</([^ >]+) *>(?!.*</[^>]+>)'  # with group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE12, xml_text_noc,re.DOTALL)
            res_eyq12 = m.group(1) if m else "FAIL"
    B.append(  clock()-te)


    RE13 = '</[^ >]+ *>(?!.*</[^>]+>)' # without group(1)   # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('(<!--.*?-->|[\n\r\t])','', xml_text,flags=re.DOTALL)
            m  = re.search(RE13, xml_text_noc,re.DOTALL)
            res_eyq13 = m.group()[2:-1] if m else "FAIL"
    C.append(  clock()-te)



    RE14 = '(?<=</)[^ \n\r\t>]+(?=[ \n\r\t]*>)(?!.*</[^>]+>)' # with assertions  # ^
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE14, xml_text_noc,re.DOTALL)
            res_eyq14 = m.group() if m else "FAIL"
    D.append(  clock()-te)


    RE15 = '</([^ \n\r\t>]+)[ \n\r\t]*>(?!.*</[^>]+>)'  # with group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE15, xml_text_noc,re.DOTALL)
            res_eyq15 = m.group(1) if m else "FAIL"
    E.append(  clock()-te)


    RE16 = '</[^ \n\r\t>]+[ \n\r\t]*>(?!.*</[^>]+>)' # without group(1)   # <
    te = clock()
    for i in xrange (n):
        with open('xml_example.xml') as f:
            f.seek(-200,2)
            xml_text = f.read()
            xml_text_noc = re.sub('<!--.*?-->','', xml_text,flags=re.DOTALL)
            m  = re.search(RE16, xml_text_noc,re.DOTALL)
            res_eyq16 = m.group()[2:-1].rstrip() if m else "FAIL"
    F.append(  clock()-te)


print "***      parse:  " + res_parse, '  parse'
print "***  iterparse:  " + res_iterparse, '  iterparse'
print
print "***  eyquem 11:  " + repr(res_eyq11)
print "***  eyquem 12:  " + repr(res_eyq12)
print "***  eyquem 13:  " + repr(res_eyq13)
print "***  eyquem 14:  " + repr(res_eyq14)
print "***  eyquem 15:  " + repr(res_eyq15)
print "***  eyquem 16:  " + repr(res_eyq16)

print
print str(min(P))
print str(min(I))
print
print '\n'.join(str(u) for u in map(min,(A,B,C)))
print
print '\n'.join(str(u) for u in map(min,(D,E,F)))

Result:

length of XML text in a file :  22548 

***      parse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   parse
***  iterparse:  {http://www.w3.org/XSL/Transform/1.0}stylesheet   iterparse

***  eyquem 11:  'stylesheet'
***  eyquem 12:  'stylesheet'
***  eyquem 13:  'stylesheet'
***  eyquem 14:  'stylesheet'
***  eyquem 15:  'stylesheet'
***  eyquem 16:  'stylesheet'

0.220554691169
0.172240771802

0.0273236743636
0.0266525536625
0.0265308269626

0.0246300539733
0.0241203758299
0.0238024015203

.

.

Considering your unsophisticated need, Aereal, I think that you don't care to have an end-tag of the root with possible characters \r \n \t in it, instead of its name alone; So the best solution for you is, in my opinion:

def get_root_tag_from_xml_file(xml_file_path):
    with open(xml_file_path) as f:
        try:      f.seek(-200,2)
        except:   f.seek(0,0)
        finally:  xml_text_noc = re.sub('<!--.*?-->','', f.read(), flags= re.DOTALL)
        try:
            return re.search('</[^>]+>(?!.*</[^>]+>)' , xml_text_noc, re.DOTALL).group()
        except :
            return 'FAIL'

Thanks to the expertise of John Machin, this solution do a more reliable job than my previous one; and in addition it answers exactly to the demand, as it was expressed: no parsing, hence a faster method, as it was implicitly aimed at.

.

John Machin, will you find a new tricky feature of XML format that will invalidate this solution ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文