有没有一种优雅的方法可以在python中使用lxml来计算xml文件中的标签元素？

发布于 2024-11-16 23:58:33 字数 423 浏览 1 评论 0原文

我可以将 xml 文件的内容读取为字符串并使用字符串操作来实现此目的，但我想有一种更优雅的方法来实现此目的。由于我没有在文档中找到线索，因此我在这里进行了搜索：

给定一个 xml（见下文）文件，如何计算 xml 标签，例如 作者标签的计数下面的例子是最优雅的方式？我们假设每个作者只出现一次。

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

这个 xml 文件很简单，但有可能作者并不总是依次列出，他们之间可能还有其他标签。

原文

I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here:

Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way? We assume, that each author appears exactly once.

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

This xml file is trivial, but it is possible, that the authors are not always listed one after another, there may be other tags between them.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃素的狼 2024-11-23 23:58:33

如果您想计算所有作者标签：

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')

If you want to count all author tags:

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')

回复收藏 0 原文

伴我心暖 2024-11-23 23:58:33

将 XPath 与 计数。

回复收藏 0 原文

伤痕我心 2024-11-23 23:58:33

使用模块 re 处理 SGML/XML/HTML 文本时必须小心，因为并非所有此类文件的处理都不能使用正则表达式执行（正则表达式无法解析< /strong> SGML/HTML/XML 文本）

但是在这里，在这个特定问题中，在我看来这是可能的（re.DOTALL 是强制性的，因为一个元素可能会扩展超过一行；除此之外，我不能想象任何其他可能的陷阱）

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

结果

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds

One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

result

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds

回复收藏 0 原文

~没有更多了~

关于作者

断舍离

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

有没有一种优雅的方法可以在python中使用lxml来计算xml文件中的标签元素？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

daid

我心依旧

晒暮凉

微信用户

DS

〆凄凉。

友情链接

有没有一种优雅的方法可以在python中使用lxml来计算xml文件中的标签元素？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

daid

我心依旧

晒暮凉

微信用户

DS

〆凄凉。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。