将所有汤都超过某个div

发布于 2025-02-10 22:49:58 字数 388 浏览 1 评论 0原文

我有这种格式的汤：

<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>

我想刮擦桌子和吧台之间的所有段落。挑战在于，这些段落数量并不恒定。因此，我不能仅仅获得前三段（可能是1-5的任何一段）。

我该如何分割这种汤以获取段落。 Regex起初似乎很不错，但是对我来说不起作用，因为以后我仍然需要一个汤对象来进一步提取。

谢谢一吨

原文

I have a soup of this format:

<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>

I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).

How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.

Thanks a ton

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱你是孤单的心事 2025-02-17 22:49:58

您可以选择您的元素，迭代其兄弟姐妹和break如果没有p：

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

或其他方式，则越来越接近您的初始问题 -选择＆lt; div class ='bar'＆gt;和find_previous_siblings（'p'）：

for t in soup.select_one('.bar').find_previous_siblings('p'):
    print(t)

示例

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

输出

<p> </p>
<p> </p>
<p> </p>

You could select your element, iterate over its siblings and break if there is no p:

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):

for t in soup.select_one('.bar').find_previous_siblings('p'):
    print(t)

Example

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

for t in soup.div.table.find_next_siblings():
    if t.name != 'p':
        break
    print(t)

Output

<p> </p>
<p> </p>
<p> </p>

回复收藏 0 原文

独闯女儿国 2025-02-17 22:49:58

如果如图所示，则只需使用：不要稍后过滤掉兄弟姐妹p标签

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

soup.select('.foo > table ~ p:not(.bar ~ p)')

If html as shown then just use :not to filter out later sibling p tags

from bs4 import BeautifulSoup

html='''
<div class = 'foo'>
  <table> </table>
  <p> </p>
  <p> </p>
  <p> </p>
  <div class = 'bar'>
  <p> </p>
  .
  .
</div>
'''
soup = BeautifulSoup(html)

soup.select('.foo > table ~ p:not(.bar ~ p)')

回复收藏 0 原文

~没有更多了~