将所有汤都超过某个div

发布于 2025-02-10 22:49:58 字数 388 浏览 1 评论 0原文

我有这种格式的汤:

<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>

我想刮擦桌子和吧台之间的所有段落。挑战在于,这些段落数量并不恒定。因此,我不能仅仅获得前三段(可能是1-5的任何一段)。

我该如何分割这种汤以获取段落。 Regex起初似乎很不错,但是对我来说不起作用,因为以后我仍然需要一个汤对象来进一步提取。

谢谢一吨

I have a soup of this format:

<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>

I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).

How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.

Thanks a ton

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

爱你是孤单的心事 2025-02-17 22:49:58

您可以选择您的元素,迭代其兄弟姐妹break如果没有p

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

或其他方式,则越来越接近您的初始问题 -选择&lt; div class ='bar'&gt;find_previous_siblings('p')

for t in soup.select_one('.bar').find_previous_siblings('p'):
    print(t)
示例
from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)
输出
<p> </p>
<p> </p>
<p> </p>

You could select your element, iterate over its siblings and break if there is no p:

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):

for t in soup.select_one('.bar').find_previous_siblings('p'):
    print(t)
Example
from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)
Output
<p> </p>
<p> </p>
<p> </p>
独闯女儿国 2025-02-17 22:49:58

如果如图所示,则只需使用:不要稍后过滤掉兄弟姐妹p标签

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

soup.select('.foo > table ~ p:not(.bar ~ p)')

If html as shown then just use :not to filter out later sibling p tags

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

soup.select('.foo > table ~ p:not(.bar ~ p)')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文