lxml 的 iterparse 中存在多个标签名称?
有没有办法从lxml的lxml.etree.iterparse获取多个标签名称?我有一个类似文件的对象,具有昂贵的读取操作和许多标签,因此获取所有标签或执行两次传递并不是最理想的。
编辑:它类似于 Beautiful Soup 的 find(['tag-1', 'tag-2])
,除了作为 iterparse 的参数。想象一下解析 HTML 页面的 和
标记。Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two passes is suboptimal.
Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2])
, except as an argument to iterparse. Imagine parsing an HTML page for both <td>
and <div>
tags.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我知道我比赛迟到了,但也许其他人需要帮助解决同样的问题。
此代码将为
Tag1
和Tag2
标记生成事件:I know I'm late for the game, but maybe someone else needs help with the same issue.
This code will generate events for both
Tag1
andTag2
tags:我不是 100% 确定你所说的“获取所有标签”是什么意思,但也许这就是你正在寻找的:
iterparse
在解析过程中动态生成事件,所以你只需读取所需的尽可能多的数据。但是,您无法在解析过程中跳过读取元素,因为您不知道要跳过多远。在上面,我们只是忽略我们不感兴趣的标签。您可能已经知道:不要对 html 使用 xml 解析器。 编辑 - 事实证明,lxml 支持 html 解析,但您应该检查文档以了解支持程度。
I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:
iterparse
generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.