问题...BeautifulSoup 解析

发布于 2024-12-01 12:29:35 字数 656 浏览 0 评论 0 原文

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

我想从Paul先生那里提取信息给blabla 有些网页在 Paul 先生前面有

,所以我可以使用 FindNext('p') 但是,有些网页没有

就像上面的例子一样。

这是我的代码,当有

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

但是当我没有

我如何提取信息?

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

I would like to extract information from Mr. Paul to blabla
Some webpage has <p> infront of Mr. Paul, so I could use FindNext('p')
However, some webpages do not have <p> like the example above..

this is my code for when there is <p>

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

But when I don't have <p> how I could extract information?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

唠甜嗑 2024-12-08 12:29:35

从您给我们的示例中很难看出,但在我看来,您可以在 h2 之后获取下一个节点。在此示例中,Lewis Carroll 有一个 p-aragraph 标签,而您的朋友 Paul 只有一个结束 span 标签:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

以下评论:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

您可以,当然,希望检查版权声明,等等...

It's hard to tell from the example you have given us, but it looks to me that you could just get the next node after an h2. In this example, Lewis Carroll has a p-aragraph tag and your friend Paul has only a closing span tag:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

Following comments:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

You may, of course, wish to check copyright notices, et cetera...

懒猫 2024-12-08 12:29:35

“有些网页在 Paul 先生前面有

,所以我可以使用 FindNext('p') 但是,有些网页没有就像上面的例子。”


您没有提供足够的信息来识别您的字符串:

  • 固定节点结构,例如 getChildren()[1].getChildren()[0].text
  • 如果前面有魔法根据您的代码,字符串 'BACKGROUND' ,那么您查找下一个节点的方法似乎不错 - 只是不要假设标签名称是 'p'
  • 正则表达式(例如 "(Mr.|Ms.) ...")

向我们展示一个名称前面没有

的 HTML 示例?

"Some webpage has<p>infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have<p>like the example above."

You're not giving enough information to be able to recognize your string:

  • fixed node structure e.g. getChildren()[1].getChildren()[0].text
  • if it's preceded by the magic string 'BACKGROUND' as per your code, then your approach of finding the next node seems good - just don't build in an assumption that the tag name is 'p'
  • regex (e.g. "(Mr.|Ms.) ...")

Show us a HTML example when it does not have <p> in front of name?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文