使用 lxml,如何查找和收集特定类型标签之间的所有元素?
我有一个 html 文档,其某些部分的开头带有内部链接(即 标签)。
我想访问每个内部链接,并递归地获取所有元素中包含的所有文本。
例如,在这两个链接之间:
<A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
<B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>
<A name='G27866102'>
我想检索:
About This Section
This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
由于链接之间的元素可以具有嵌套元素,因此我也想获取所有文本(即,递归每个子元素并获取该文本)。
例如,从此:
<A name='G27866102'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
<B><FONT style="font-family: 'Times New Roman', Times">Additional Information</FONT></B>
</DIV>
</A>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
As permitted by house rules, this section is ...
<DIV><FONT style="font-family: 'Times New Roman', Times">There's nested text here<FONT></DIV>
... blah ... the actual document.
</DIV>
我想得到:
Additional Information
As permitted by house rules, this section is ... There's nested text here ... blah ... the actual document.
我知道如何使用 findall('//a')
并检查 attrib
哈希中的“name”键,但这只是让我得到 标签元素。
理想情况下,我希望能够定义一个递归 get_all_nodes_in_ Between()
函数,其工作方式如下:
anchors = html.findall('//a')
for i, anchor in enumerate(anchors):
if anchor.attrib.has_key('name'):
all_elements = get_all_nodes_in_between(anchor, anchor[(i+1)]
如何做到这一点?
I have an html document with internal links (i.e., <a name="blah"></a>
tags) at the start of certain sections.
I want to visit each internal link, and grab all the text contained in all the elements recursively in between.
For example, in between these 2 links:
<A name='G27866101'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
<B><FONT style="font-family: 'Times New Roman', Times">About This Section</FONT></B>
</DIV>
</A>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
</DIV>
<A name='G27866102'>
I want to retrieve:
About This Section
This section is part of a registration that we filed with the proper authorities ... blah ... for more information.
And since the elements between links can have nested elements, I want to get all that text as well (i.e., recurse through each child element and get that text).
For example, from this:
<A name='G27866102'>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: transparent">
<B><FONT style="font-family: 'Times New Roman', Times">Additional Information</FONT></B>
</DIV>
</A>
<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 3%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent">
As permitted by house rules, this section is ...
<DIV><FONT style="font-family: 'Times New Roman', Times">There's nested text here<FONT></DIV>
... blah ... the actual document.
</DIV>
I'd like to get:
Additional Information
As permitted by house rules, this section is ... There's nested text here ... blah ... the actual document.
I know about using findall('//a')
and checking the attrib
hash for a 'name' key, but that just gets me the <a name="blah"></a>
tag elements.
Ideally, I'd like to be able to define a recursive get_all_nodes_in_between()
function that would work like this:
anchors = html.findall('//a')
for i, anchor in enumerate(anchors):
if anchor.attrib.has_key('name'):
all_elements = get_all_nodes_in_between(anchor, anchor[(i+1)]
How can this be done?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在 BeautifulSoup 中通过检查 DIV 标签并获取其中的所有数据来完成此操作:
编辑了注释中的代码:
输出:
You can do this in BeautifulSoup by the checking for the DIV tag and grabbing all data within there:
Edited code from comments:
outputs: