使用 lxml,如何查找和收集特定类型标签之间的所有元素?

发布于 2024-12-11 13:26:09 字数 2426 浏览 0 评论 0原文

我有一个 html 文档,其某些部分的开头带有内部链接(即 标签)。

我想访问每个内部链接,并递归地获取所有元素中包含的所有文本。

例如,在这两个链接之间:

<A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>

<A name='G27866102'>

我想检索:

About This Section

This section is part of a registration that we filed with the proper authorities ... blah ... for more information.

由于链接之间的元素可以具有嵌套元素,因此我也想获取所有文本(即,递归每个子元素并获取该文本)。

例如,从此:

<A name='G27866102'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">Additional Information</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    As permitted by house rules, this section is ... 
    <DIV><FONT style="font-family: 'Times New Roman', Times">There's nested text here<FONT></DIV>
    ... blah ... the actual document.
</DIV>

我想得到:

Additional Information

As permitted by house rules, this section is ... There's nested text here ... blah ... the actual document.

我知道如何使用 findall('//a') 并检查 attrib 哈希中的“name”键,但这只是让我得到 标签元素。

理想情况下,我希望能够定义一个递归 get_all_nodes_in_ Between() 函数,其工作方式如下:

anchors = html.findall('//a')
for i, anchor in enumerate(anchors):
    if anchor.attrib.has_key('name'):
        all_elements = get_all_nodes_in_between(anchor, anchor[(i+1)]

如何做到这一点?

I have an html document with internal links (i.e., <a name="blah"></a> tags) at the start of certain sections.

I want to visit each internal link, and grab all the text contained in all the elements recursively in between.

For example, in between these 2 links:

<A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>

<A name='G27866102'>

I want to retrieve:

About This Section

This section is part of a registration that we filed with the proper authorities ... blah ... for more information.

And since the elements between links can have nested elements, I want to get all that text as well (i.e., recurse through each child element and get that text).

For example, from this:

<A name='G27866102'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">Additional Information</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    As permitted by house rules, this section is ... 
    <DIV><FONT style="font-family: 'Times New Roman', Times">There's nested text here<FONT></DIV>
    ... blah ... the actual document.
</DIV>

I'd like to get:

Additional Information

As permitted by house rules, this section is ... There's nested text here ... blah ... the actual document.

I know about using findall('//a') and checking the attrib hash for a 'name' key, but that just gets me the <a name="blah"></a> tag elements.

Ideally, I'd like to be able to define a recursive get_all_nodes_in_between() function that would work like this:

anchors = html.findall('//a')
for i, anchor in enumerate(anchors):
    if anchor.attrib.has_key('name'):
        all_elements = get_all_nodes_in_between(anchor, anchor[(i+1)]

How can this be done?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

是你 2024-12-18 13:26:09

您可以在 BeautifulSoup 中通过检查 DIV 标签并获取其中的所有数据来完成此操作:

编辑了注释中的代码:

    from BeautifulSoup import BeautifulSoup
    import re

    html ='''
        <A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>

<A name='G27866102'>
    '''
    soup = BeautifulSoup(html)

    for item in soup.findAll('div'):
        print ''.join(item.findAll(text=True))

输出:

About This Section


This section is part of a registration that we filed with the proper authorities ... blah ... for more information.

You can do this in BeautifulSoup by the checking for the DIV tag and grabbing all data within there:

Edited code from comments:

    from BeautifulSoup import BeautifulSoup
    import re

    html ='''
        <A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
    <B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
    This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>

<A name='G27866102'>
    '''
    soup = BeautifulSoup(html)

    for item in soup.findAll('div'):
        print ''.join(item.findAll(text=True))

outputs:

About This Section


This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文