lxml 的 iterparse 中存在多个标签名称?

发布于 2024-09-15 07:37:40 字数 261 浏览 4 评论 0原文

有没有办法从lxml的lxml.etree.iterparse获取多个标签名称?我有一个类似文件的对象,具有昂贵的读取操作和许多标签,因此获取所有标签或执行两次传递并不是最理想的。

编辑:它类似于 Beautiful Soup 的 find(['tag-1', 'tag-2]) ,除了作为 iterparse 的参数。想象一下解析 HTML 页面的

标记。

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two passes is suboptimal.

Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]), except as an argument to iterparse. Imagine parsing an HTML page for both <td> and <div> tags.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不气馁 2024-09-22 07:37:40

我知道我比赛迟到了,但也许其他人需要帮助解决同样的问题。
此代码将为 Tag1Tag2 标记生成事件:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))

I know I'm late for the game, but maybe someone else needs help with the same issue.
This code will generate events for both Tag1 and Tag2 tags:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))
丢了幸福的猪 2024-09-22 07:37:40

我不是 100% 确定你所说的“获取所有标签”是什么意思,但也许这就是你正在寻找的:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse 在解析过程中动态生成事件,所以你只需读取所需的尽可能多的数据。但是,您无法在解析过程中跳过读取元素,因为您不知道要跳过多远。在上面,我们只是忽略我们不感兴趣的标签。

您可能已经知道:不要对 html 使用 xml 解析器。 编辑 - 事实证明,lxml 支持 html 解析,但您应该检查文档以了解支持程度。

I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.

As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文