将所有汤都超过某个div
我有这种格式的汤:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
我想刮擦桌子和吧台之间的所有段落。挑战在于,这些段落数量并不恒定。因此,我不能仅仅获得前三段(可能是1-5的任何一段)。
我该如何分割这种汤以获取段落。 Regex起初似乎很不错,但是对我来说不起作用,因为以后我仍然需要一个汤对象来进一步提取。
谢谢一吨
I have a soup of this format:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).
How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.
Thanks a ton
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以选择您的元素,迭代其
兄弟姐妹
和break
如果没有p
:或其他方式,则越来越接近您的初始问题 -选择
&lt; div class ='bar'&gt;
和find_previous_siblings('p')
:示例
输出
You could select your element, iterate over its
siblings
andbreak
if there is nop
:or other way around and closer to your initial question - select the
<div class = 'bar'>
andfind_previous_siblings('p')
:Example
Output
如果如图所示,则只需使用:不要稍后过滤掉兄弟姐妹p标签
If html as shown then just use :not to filter out later sibling p tags