Python 网页抓取;美丽的汤
这篇文章对此进行了介绍:Python Web 抓取涉及带有属性的 HTML 标签< /a>
但我无法对此网页执行类似的操作: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?
我正在尝试抓取以下值:
<td class="price city-2">
NZ$15.62
<span style="white-space:nowrap;">(AU$12.10)</span>
</td>
<td class="price city-1">
AU$15.82
</td>
基本上价格 city-2 和价格 city-1 (NZ$15.62 和 AU$15.82)
目前有:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
price2 = soup.findAll('td', attrs = {'class':'price city-2'})
price1 = soup.findAll('td', attrs = {'class':'price city-1'})
for price in price2:
print price
for price in price1:
print price
理想情况下,我还希望有逗号分隔的值:
<th colspan="3" class="clickable">Food</th>,
提取“食物”,
<td class="item-name">Daily menu in the business district</td>
提取“商业区每日菜单”
,然后是price city-2和price-city1的值
因此打印输出将是:
Food,商业区每日菜单, 15.62 新西兰元,15.82 澳元,
谢谢!
This was covered in this post: Python web scraping involving HTML tags with attributes
But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?
I'm trying to scrape the values of:
<td class="price city-2">
NZ$15.62
<span style="white-space:nowrap;">(AU$12.10)</span>
</td>
<td class="price city-1">
AU$15.82
</td>
Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82)
Currently have:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
price2 = soup.findAll('td', attrs = {'class':'price city-2'})
price1 = soup.findAll('td', attrs = {'class':'price city-1'})
for price in price2:
print price
for price in price1:
print price
Ideally, I'd also like to have comma separated values for:
<th colspan="3" class="clickable">Food</th>,
Extracting 'Food',
<td class="item-name">Daily menu in the business district</td>
Extracting 'Daily menu in the business district'
and then the values for price city-2, and price-city1
So the printout would be:
Food, Daily menu in the business district, NZ$15.62, AU$15.82
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我发现 BeautifulSoup 用起来很尴尬。以下是基于网络抓取模块的版本:
输出:
I find BeautifulSoup awkward to use. Here is a version based on the webscraping module:
Output:
如果您将目标页面的 HTML 加载到变量 htmlsource 中,此 pyparsing webscraper 将以给定的 CSV 格式格式化您的数据:
打印:
If you load the target page's HTML into a variable htmlsource, this pyparsing webscraper will format your data in the given CSV format:
prints:
这是如预期的那样吗?
Is this as intended?