Python 网页抓取；美丽的汤

发布于 2024-12-11 20:55:00 字数 1675 浏览 0 评论 0原文

这篇文章对此进行了介绍：Python Web 抓取涉及带有属性的 HTML 标签< /a>

但我无法对此网页执行类似的操作： http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland？

我正在尝试抓取以下值：

  <td class="price city-2">
                                                      NZ$15.62
                                      <span style="white-space:nowrap;">(AU$12.10)</span>
                                                  </td>
  <td class="price city-1">
                                                      AU$15.82
                              </td>

基本上价格 city-2 和价格 city-1 （NZ$15.62 和 AU$15.82）

目前有：

import urllib2

from BeautifulSoup import BeautifulSoup

url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

price2 = soup.findAll('td', attrs = {'class':'price city-2'})
price1 = soup.findAll('td', attrs = {'class':'price city-1'})

for price in price2:
    print price

for price in price1:
    print price

理想情况下，我还希望有逗号分隔的值：

<th colspan="3" class="clickable">Food</th>,

提取“食物”，

<td class="item-name">Daily menu in the business district</td>

提取“商业区每日菜单”

，然后是price city-2和price-city1的值

因此打印输出将是：

Food，商业区每日菜单， 15.62 新西兰元，15.82 澳元，

谢谢！

原文

This was covered in this post: Python web scraping involving HTML tags with attributes

But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?

I'm trying to scrape the values of:

  <td class="price city-2">
                                                      NZ$15.62
                                      <span style="white-space:nowrap;">(AU$12.10)</span>
                                                  </td>
  <td class="price city-1">
                                                      AU$15.82
                              </td>

Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82)

Currently have:

import urllib2

from BeautifulSoup import BeautifulSoup

url = "http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland?"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

price2 = soup.findAll('td', attrs = {'class':'price city-2'})
price1 = soup.findAll('td', attrs = {'class':'price city-1'})

for price in price2:
    print price

for price in price1:
    print price

Ideally, I'd also like to have comma separated values for:

<th colspan="3" class="clickable">Food</th>,

Extracting 'Food',

<td class="item-name">Daily menu in the business district</td>

Extracting 'Daily menu in the business district'

and then the values for price city-2, and price-city1

So the printout would be:

Food, Daily menu in the business district, NZ$15.62, AU$15.82

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水波映月 2024-12-18 20:55:00

我发现 BeautifulSoup 用起来很尴尬。以下是基于网络抓取模块的版本：

from webscraping import common, download, xpath

# download html
D = download.Download()
html = D.get('http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland')

# extract data
items = xpath.search(html, '//td[@class="item-name"]')
city1_prices = xpath.search(html, '//td[@class="price city-1"]')
city2_prices = xpath.search(html, '//td[@class="price city-2"]')

# display and format
for item, city1_price, city2_price in zip(items, city1_prices, city2_prices):
    print item.strip(), city1_price.strip(), common.remove_tags(city2_price, False).strip()

输出：

商业区每日菜单 AU$15.82 NZ$15.62
快餐店套餐（巨无霸餐或类似） AU$7.40 NZ$8.16
1/2 公斤（1 磅）鸡胸肉 AU$6.07 NZ$10.25
...

I find BeautifulSoup awkward to use. Here is a version based on the webscraping module:

from webscraping import common, download, xpath

# download html
D = download.Download()
html = D.get('http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland')

# extract data
items = xpath.search(html, '//td[@class="item-name"]')
city1_prices = xpath.search(html, '//td[@class="price city-1"]')
city2_prices = xpath.search(html, '//td[@class="price city-2"]')

# display and format
for item, city1_price, city2_price in zip(items, city1_prices, city2_prices):
    print item.strip(), city1_price.strip(), common.remove_tags(city2_price, False).strip()

Output:

Daily menu in the business district AU$15.82 NZ$15.62
Combo meal in fast food restaurant (Big Mac Meal or similar) AU$7.40 NZ$8.16
1/2 Kg (1 lb.) of chicken breast AU$6.07 NZ$10.25
...

回复收藏 0 原文

早茶月光 2024-12-18 20:55:00

如果您将目标页面的 HTML 加载到变量 htmlsource 中，此 pyparsing webscraper 将以给定的 CSV 格式格式化您的数据：

from pyparsing import *

th,thEnd = makeHTMLTags("th")
thCategory = th.setParseAction(withAttribute(**{'class':'clickable', 'colspan':'3'}))
category = thCategory.suppress() + SkipTo(thEnd)('category') + thEnd

# set up tag recognizers, with specialized patterns based on class attribute
td, tdEnd = makeHTMLTags("td")
tdWithClass = lambda cls : td.copy().setParseAction(withAttribute(**{'class':cls}))
itemTd = tdWithClass('item-name')
price1Td = tdWithClass('price city-1')
price2Td = tdWithClass('price city-2')

# define some currencies
currency = oneOf("NZ$ AU$ US$ SG$").setName("currency")

# define a currency amount as a real number
amount = Regex(r'\d+,\d{3}|\d+(\.\d+)?').setParseAction(lambda t:float(t[0].replace(',','')))

# define the format of a city value
cityval = Group((price1Td | price2Td) + currency("currency") + amount("amt") + SkipTo(tdEnd) + tdEnd)

# define a comparison item, including item name and item cost in city1 and city2
comparison = Group(itemTd + SkipTo(tdEnd)("item") + tdEnd + (cityval*2)("valuedata"))

# attach a parse action to clean up automated token naming
def assignPriceTags(t):
    for v in t[0].valuedata:
        if v['class'] == 'price city-1':
            t[0]['price1'] = v
        else:
            t[0]['price2'] = v

    # remove extraneous results names created by makeHTMLTags
    for tg in 'class tag startTd endTd empty'.split():
        del t[0][tg]
        for v in t[0].valuedata:
            del v[tg]
    del t[0]['valuedata']
comparison.setParseAction(assignPriceTags)


currentcategory = ''
for compdata in (category|comparison).searchString(htmlsource):
    if 'category' in compdata:
        currentcategory = compdata.category
        continue
    compdata = compdata[0]
    #~ print compdata.dump()
    print "%s, %s, %s%s, %s%s" % (currentcategory, compdata.item,
                 compdata.price1.currency, compdata.price1.amt,
                 compdata.price2.currency, compdata.price2.amt)

打印：

Food, Daily menu in the business district, AU$15.82, NZ$15.62
Food, Combo meal in fast food restaurant (Big Mac Meal or similar), AU$7.4, NZ$7.91
Food, 1/2 Kg (1 lb.) of chicken breast, AU$6.07, NZ$10.25
Food, 1 liter (1 qt.) of whole fat milk, AU$1.8, NZ$2.65
Food, 500 gr (16 oz.) of local cheese, AU$5.99, NZ$7.2
Food, 1 kg (2 lb.) of apples, AU$4.29, NZ$3.46
Food, 2 kg (4,5 lb.) of potatoes, AU$4.31, NZ$5.29
Food, 0.5 l (16 oz) beer in the supermarket, AU$4.12, NZ$4.36
Food, 2 liters of Coca-Cola, AU$3.07, NZ$2.64
Food, bread for 2 people for 1 day, AU$2.32, NZ$1.93
Housing, monthly rent for a 85 m2 (900 Sqft) furnished apartment in expensive area of the city, AU$1766.0, NZ$2034.0
Housing, Internet 8MB (1 month), AU$49.0, NZ$61.0
Housing, 40” flat screen TV, AU$865.0, NZ$1041.0
Housing, utilities 1 month (heating, electricity, gas ...), AU$211.0, NZ$170.0
Clothes, 1 pair of Levis 501, AU$119.0, NZ$123.0
Clothes, 1 summer dress in a chain store (Zara, H&M, ...), AU$63.0, NZ$50.0
Clothes, 1 pair of Adidas trainers, AU$142.0, NZ$166.0
Clothes, 1 pair of average business shoes, AU$130.0, NZ$133.0
Transportation, Volkswagen Golf 2.0 TDI 140 CV 6 vel. (or equivalent), with no extras, new, AU$28321.0, NZ$45574.0
Transportation, 1 liter (1/4 gallon) of gas, AU$1.43, NZ$2.13
Transportation, monthly ticket public transport, AU$110.0, NZ$138.0
Personal Care, medicine against cold for 6 days (Frenadol, Coldrex, ...), AU$14.27, NZ$17.85
Personal Care, 1 box of 32 tampons (Tampax, OB, ...), AU$5.51, NZ$7.71
Personal Care, 4 rolls of toilet paper, AU$3.57, NZ$3.07
Personal Care, Tube of toothpaste, AU$3.37, NZ$3.39
Personal Care, Standard men's haircut in expat area of the city, AU$27.0, NZ$27.0
Entertainment, 2 tickets to the movies, AU$33.0, NZ$30.0
Entertainment, 2 tickets to the theater (best available seats), AU$163.0, NZ$139.0
Entertainment, dinner out for two in Italian restaurant with wine and dessert, AU$100.0, NZ$100.0
Entertainment, basic dinner out for two in neighborhood pub, AU$46.0, NZ$46.0
Entertainment, 1 cocktail drink in downtown club, AU$14.31, NZ$14.38
Entertainment, 1 beer in neighbourhood pub, AU$4.69, NZ$6.72
Entertainment, iPod nano 8GB (6th generation), AU$176.0, NZ$252.0
Entertainment, 1 min. of prepaid mobile tariff (no discounts or plans), AU$1.14, NZ$0.84
Entertainment, 1 month of gym in business district, AU$90.0, NZ$91.0
Entertainment, 1 package of Marlboro cigarretes, AU$15.97, NZ$14.47

If you load the target page's HTML into a variable htmlsource, this pyparsing webscraper will format your data in the given CSV format:

from pyparsing import *

th,thEnd = makeHTMLTags("th")
thCategory = th.setParseAction(withAttribute(**{'class':'clickable', 'colspan':'3'}))
category = thCategory.suppress() + SkipTo(thEnd)('category') + thEnd

# set up tag recognizers, with specialized patterns based on class attribute
td, tdEnd = makeHTMLTags("td")
tdWithClass = lambda cls : td.copy().setParseAction(withAttribute(**{'class':cls}))
itemTd = tdWithClass('item-name')
price1Td = tdWithClass('price city-1')
price2Td = tdWithClass('price city-2')

# define some currencies
currency = oneOf("NZ$ AU$ US$ SG$").setName("currency")

# define a currency amount as a real number
amount = Regex(r'\d+,\d{3}|\d+(\.\d+)?').setParseAction(lambda t:float(t[0].replace(',','')))

# define the format of a city value
cityval = Group((price1Td | price2Td) + currency("currency") + amount("amt") + SkipTo(tdEnd) + tdEnd)

# define a comparison item, including item name and item cost in city1 and city2
comparison = Group(itemTd + SkipTo(tdEnd)("item") + tdEnd + (cityval*2)("valuedata"))

# attach a parse action to clean up automated token naming
def assignPriceTags(t):
    for v in t[0].valuedata:
        if v['class'] == 'price city-1':
            t[0]['price1'] = v
        else:
            t[0]['price2'] = v

    # remove extraneous results names created by makeHTMLTags
    for tg in 'class tag startTd endTd empty'.split():
        del t[0][tg]
        for v in t[0].valuedata:
            del v[tg]
    del t[0]['valuedata']
comparison.setParseAction(assignPriceTags)


currentcategory = ''
for compdata in (category|comparison).searchString(htmlsource):
    if 'category' in compdata:
        currentcategory = compdata.category
        continue
    compdata = compdata[0]
    #~ print compdata.dump()
    print "%s, %s, %s%s, %s%s" % (currentcategory, compdata.item,
                 compdata.price1.currency, compdata.price1.amt,
                 compdata.price2.currency, compdata.price2.amt)

prints:

Food, Daily menu in the business district, AU$15.82, NZ$15.62
Food, Combo meal in fast food restaurant (Big Mac Meal or similar), AU$7.4, NZ$7.91
Food, 1/2 Kg (1 lb.) of chicken breast, AU$6.07, NZ$10.25
Food, 1 liter (1 qt.) of whole fat milk, AU$1.8, NZ$2.65
Food, 500 gr (16 oz.) of local cheese, AU$5.99, NZ$7.2
Food, 1 kg (2 lb.) of apples, AU$4.29, NZ$3.46
Food, 2 kg (4,5 lb.) of potatoes, AU$4.31, NZ$5.29
Food, 0.5 l (16 oz) beer in the supermarket, AU$4.12, NZ$4.36
Food, 2 liters of Coca-Cola, AU$3.07, NZ$2.64
Food, bread for 2 people for 1 day, AU$2.32, NZ$1.93
Housing, monthly rent for a 85 m2 (900 Sqft) furnished apartment in expensive area of the city, AU$1766.0, NZ$2034.0
Housing, Internet 8MB (1 month), AU$49.0, NZ$61.0
Housing, 40” flat screen TV, AU$865.0, NZ$1041.0
Housing, utilities 1 month (heating, electricity, gas ...), AU$211.0, NZ$170.0
Clothes, 1 pair of Levis 501, AU$119.0, NZ$123.0
Clothes, 1 summer dress in a chain store (Zara, H&M, ...), AU$63.0, NZ$50.0
Clothes, 1 pair of Adidas trainers, AU$142.0, NZ$166.0
Clothes, 1 pair of average business shoes, AU$130.0, NZ$133.0
Transportation, Volkswagen Golf 2.0 TDI 140 CV 6 vel. (or equivalent), with no extras, new, AU$28321.0, NZ$45574.0
Transportation, 1 liter (1/4 gallon) of gas, AU$1.43, NZ$2.13
Transportation, monthly ticket public transport, AU$110.0, NZ$138.0
Personal Care, medicine against cold for 6 days (Frenadol, Coldrex, ...), AU$14.27, NZ$17.85
Personal Care, 1 box of 32 tampons (Tampax, OB, ...), AU$5.51, NZ$7.71
Personal Care, 4 rolls of toilet paper, AU$3.57, NZ$3.07
Personal Care, Tube of toothpaste, AU$3.37, NZ$3.39
Personal Care, Standard men's haircut in expat area of the city, AU$27.0, NZ$27.0
Entertainment, 2 tickets to the movies, AU$33.0, NZ$30.0
Entertainment, 2 tickets to the theater (best available seats), AU$163.0, NZ$139.0
Entertainment, dinner out for two in Italian restaurant with wine and dessert, AU$100.0, NZ$100.0
Entertainment, basic dinner out for two in neighborhood pub, AU$46.0, NZ$46.0
Entertainment, 1 cocktail drink in downtown club, AU$14.31, NZ$14.38
Entertainment, 1 beer in neighbourhood pub, AU$4.69, NZ$6.72
Entertainment, iPod nano 8GB (6th generation), AU$176.0, NZ$252.0
Entertainment, 1 min. of prepaid mobile tariff (no discounts or plans), AU$1.14, NZ$0.84
Entertainment, 1 month of gym in business district, AU$90.0, NZ$91.0
Entertainment, 1 package of Marlboro cigarretes, AU$15.97, NZ$14.47

回复收藏 0 原文

无敌元气妹 2024-12-18 20:55:00

这是如预期的那样吗？

import requests
from bs4 import BeautifulSoup as bs
import re

r = requests.get('https://www.expatistan.com/cost-of-living/comparison/melbourne/auckland')
soup = bs(r.content, 'lxml')
table = soup.select_one('table.comparison')

for row in table.select('tr:has(.clickable, .item-name, .price)'):
    if row.select_one('.clickable'):
        header = row.select_one('th').text
    if row.select_one('.item-name, td.price'):
        r = [re.sub(r'\s+',' ',re.sub(r'\n','',item.text.strip())) for item in row.select('.item-name, td.price') ]
        r.insert(0, header)
        print(r)

Is this as intended?

import requests
from bs4 import BeautifulSoup as bs
import re

r = requests.get('https://www.expatistan.com/cost-of-living/comparison/melbourne/auckland')
soup = bs(r.content, 'lxml')
table = soup.select_one('table.comparison')

for row in table.select('tr:has(.clickable, .item-name, .price)'):
    if row.select_one('.clickable'):
        header = row.select_one('th').text
    if row.select_one('.item-name, td.price'):
        r = [re.sub(r'\s+',' ',re.sub(r'\n','',item.text.strip())) for item in row.select('.item-name, td.price') ]
        r.insert(0, header)
        print(r)

回复收藏 0 原文

~没有更多了~