使用 beautifulSoup 抓取网页的 Python 脚本
我正在尝试使用 BeautifulSoup 抓取以下页面的内容,
<div data-referrer="pagelet_123" id="pagelet_123">
<div id="1" class="p1">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info - 1</h4>
</div></div></div><div class="phs">
<table class="uicontenttable">
<tbody>
<tr>
<th class="label">Other</th>
<td class="data"><div id="ua94ty_3" class="uiCollapsedList uiCollapsedListHidden uiCollapsedListNoSeparate pagesListData">
<span class="visible">
<a href="http://abc.com/Federer">info-2</a>,
<a href="http://abc.com/pages/Ian-Wright-Out-of-Bounds/117602014955747">info-3</a>,
<a href="http://abc.com/JuniperNetworks">info-4</a>,
<a href="http://abc.com/pages/Join-Diaspora/118635234836351">info-5</a>
</span>
</div>
</td>
<td class="rightCol">
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div data-referrer="pagelet_ent" id="pagelet_ent">
<div id="2" class="section2">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info-6</h4>
</div></div></div>
<div class="phs"><table class="uiInfoTable mtm profileInfoTable">
<tbody>
<tr>
<th class="label">info - 7</th><td class="data">
<div class="mediaRowWrapper ">
<ul class="uiList uiListHorizontal clearfix pbl mediaRow">
<li class="uiListItem uiListHorizontalItemBorder uiListHorizontalItem">
<a href="URL - 1">
<div class="mediaPortrait">
<div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo">
<img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="Hans Zimmer" alt="" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/203614_7170054127_6578457_s.jpg" class="img"></div><div class="mediaPageName">info - 8</div></div></a></li><li class="pls uiListItem uiListHorizontalItemBorder uiListHorizontalItem">
<a href="URL - 2">
<div class="mediaPortrait"><div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo"><img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="test" alt="" src="http://external.ak.fbcdn.net/safe_image.php?d=AQCVRllyopjA_z5F&w=100&h=300&url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F59%2F-2.jpg&fallback=hub_music&prefix=s" class="img"></div><div class="mediaPageName">test</div></div></a>
</div>
<div class="mediaPageName">info - 8
</div>
</div>
</a>
该页面包含多个嵌套的 div 和表格。在使用 BeautifulSoup 时需要帮助 仅解析 info - 1 info -2 ... info -6 和 URL - 1 和 URL -2。
我读了 BeautifulSoup 的文档,它没有多大帮助。还请推荐一些 BeautifulSoup 参考文档,用于解析复杂网页的书籍。
感谢您的帮助,不胜感激!
坐
I am trying to scrape the contents of the following page using BeautifulSoup,
<div data-referrer="pagelet_123" id="pagelet_123">
<div id="1" class="p1">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info - 1</h4>
</div></div></div><div class="phs">
<table class="uicontenttable">
<tbody>
<tr>
<th class="label">Other</th>
<td class="data"><div id="ua94ty_3" class="uiCollapsedList uiCollapsedListHidden uiCollapsedListNoSeparate pagesListData">
<span class="visible">
<a href="http://abc.com/Federer">info-2</a>,
<a href="http://abc.com/pages/Ian-Wright-Out-of-Bounds/117602014955747">info-3</a>,
<a href="http://abc.com/JuniperNetworks">info-4</a>,
<a href="http://abc.com/pages/Join-Diaspora/118635234836351">info-5</a>
</span>
</div>
</td>
<td class="rightCol">
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div data-referrer="pagelet_ent" id="pagelet_ent">
<div id="2" class="section2">
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection">
<div class="clearfix uiHeaderTop">
<div>
<h4 class="uiHeaderTitle">info-6</h4>
</div></div></div>
<div class="phs"><table class="uiInfoTable mtm profileInfoTable">
<tbody>
<tr>
<th class="label">info - 7</th><td class="data">
<div class="mediaRowWrapper ">
<ul class="uiList uiListHorizontal clearfix pbl mediaRow">
<li class="uiListItem uiListHorizontalItemBorder uiListHorizontalItem">
<a href="URL - 1">
<div class="mediaPortrait">
<div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo">
<img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="Hans Zimmer" alt="" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/203614_7170054127_6578457_s.jpg" class="img"></div><div class="mediaPageName">info - 8</div></div></a></li><li class="pls uiListItem uiListHorizontalItemBorder uiListHorizontalItem">
<a href="URL - 2">
<div class="mediaPortrait"><div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo"><img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="test" alt="" src="http://external.ak.fbcdn.net/safe_image.php?d=AQCVRllyopjA_z5F&w=100&h=300&url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F59%2F-2.jpg&fallback=hub_music&prefix=s" class="img"></div><div class="mediaPageName">test</div></div></a>
</div>
<div class="mediaPageName">info - 8
</div>
</div>
</a>
This page contains multiple nested div's and table. need help in using BeautifulSoup to
parse only info - 1 info -2 ... info -6 and URL - 1 and URL -2.
I read BeautifulSoup's documentation, it was not much helpful. also please suggest some BeautifulSoup reference doc, book for parsing complex web pages.
Thanks for your help, appreciated!
sat
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
他们的文档不能满足您的目的?
http://www.crummy.com/software/BeautifulSoup/documentation.html
在我看来,您会想要类似的东西:
该代码未经测试,但是如何使用 BeautifulSoup 的一般思想。
Their documentation doesn't serve your purposes?
http://www.crummy.com/software/BeautifulSoup/documentation.html
It looks to me like you're going to want something like:
That code isn't tested, but is the general idea of how to use BeautifulSoup.