有没有一种优雅的方法可以在python中使用lxml来计算xml文件中的标签元素?

发布于 2024-11-16 23:58:33 字数 423 浏览 1 评论 0原文

我可以将 xml 文件的内容读取为字符串并使用字符串操作来实现此目的,但我想有一种更优雅的方法来实现此目的。由于我没有在文档中找到线索,因此我在这里进行了搜索:

给定一个 xml(见下文)文件,如何计算 xml 标签,例如 作者标签的计数下面的例子是最优雅的方式我们假设每个作者只出现一次。

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

这个 xml 文件很简单,但有可能作者并不总是依次列出,他们之间可能还有其他标签。

I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here:

Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way? We assume, that each author appears exactly once.

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

This xml file is trivial, but it is possible, that the authors are not always listed one after another, there may be other tags between them.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吃素的狼 2024-11-23 23:58:33

如果您想计算所有作者标签:

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')

If you want to count all author tags:

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')
伴我心暖 2024-11-23 23:58:33

XPath计数

Use an XPath with count.

伤痕我心 2024-11-23 23:58:33

使用模块 re 处理 SGML/XML/HTML 文本时必须小心,因为并非所有此类文件的处理都不能使用正则表达式执行(正则表达式无法解析< /strong> SGML/HTML/XML 文本)

但是在这里,在这个特定问题中,在我看来这是可能的(re.DOTALL 是强制性的,因为一个元素可能会扩展超过一行;除此之外,我不能想象任何其他可能的陷阱)

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

结果

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds

One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

result

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文